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Abstract 

Since Chen's Entity-Relationship (ER) model, conceptual modeling has been playing a 
fundamental role in relational data design. In this paper we consider an extended ER 
(EER) model enriched with cardinality constraints, disjointness assertions, and is-a re- 
lations among both entities and relationships. In this setting, we consider the case of 
incomplete data, which is likely to occur, for instance, when data from different sources 
are integrated. In such a context, we address the problem of providing correct answers to 
conjunctive queries by reasoning on the schema. Based on previous results about decidabil- 
ity of the problem, we provide a query answering algorithm that performs rewriting of the 
initial query into a recursive Datalog query encoding the information about the schema. 
We finally show extensions to more general settings. This paper will appear in the special 
issue of Theory and Practice of Logic Programming (TPLP) titled Logic Programming in 
Databases: From Datalog to Semantic-Web Rules. 
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1 Introduction 

Conceptual data models, and in particular the Entity-Relationship (ER) 
model (jChen 1976p . have long been playing a fundamental role in database design. 
With the emerging trends in data exchange, information integration, semantic web, 
and web information systems, the need for dealing with inconsistent and incomplete 
data has arisen. In this context, it is important to provide correct answers to queries 
posed over inconsistent and incomplete data (jArenas et al. 1999| . It is worth notic- 
ing here that inconsistency and incompleteness of data is considered with respect 
to a set of constraints (a.k.a. data dependencies). Such constraints, rather than 
expressing properties that hold on the data, are used to represent properties of the 
domain of interest. 

We address the problem of answering queries over incomplete data, where queries 
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are conjunctive queries expressed over particular relational schemata, called con- 
ceptual schemata, that are derived from conceptual models. As for the concep- 
tual models, we follow (jChen 1976p . and we adopt an extension of the well-known 
Entity- Relationship model, that we call Extended Entity-Relationship (EER) 
Model, along with ([Thalheim 2000p and the many variants of the classical ER 
Model. Such an extension is widely adopted in practice and is able to repre- 
sent classes of objects with their attributes, relationships among classes, cardi- 
nality constraints in the participation of entities in relationships, and is-a rela- 
tions among both classes and relationships. We provide a formal semantics to our 
conceptual model in terms of the relational database model, similarly to what is 
done in ( [Markowitz and Makowsky 1990[ ). This allows us to formulate conjunctive 
queries over EER schemata. We do this by providing a translation from EER into 
relational, whose purpose is to obtain a precise characterization of the relational 
dependencies that are derived from an EER schema in a design process. 

In the presence of data that arc incomplete w.r.t. to a set of constraints, we 
need to reason about the dependencies in order to provide certain answers; we do 
this in a model-theoretic fashion, following the approach of ([Arenas et al. 19991 
ICah et al. 200T|) . Intuitively, we start from a given, incomplete database for the 
relational schema associated with the EER schema; such data, together with the 
constraints, are interpreted as a logical theory, with a (possibly infinite) set of mod- 
els, also called solutions in the literature. We adopt the so-called sound semantics 
(see, e.g., (jCali et al. 2003a"|) ): a database is a model if it is a superset of the initial 
data, and satisfies the constraints. Given a query, the certain answers are those that 
are true in all models. 

In this paper we address the problem of answering conjunctive queries over 
schemata derived from EER schemata in the presence of incomplete data with 
respect to the schema under the sound semantics. We present an algorithm, based 
on encoding the information about the conceptual schema and the instance into 
a rewriting of the conjunctive query in Datalog, which computes the certain an- 
swers to queries posed in such a context. The algorithm reasons on the integrity 
constraints and the query. 

The problem at hand can be sketchily stated as follows. 

• We have a conceptual EER schema. From it, a relational schema S is ob- 
tained through a translation mechanism that also produces a set of integrity 
constraints S consisting of key and inclusion dependencies. 

• We also have an instance D for S. D may be inconsistent with respect to E 
and incomplete. 

• Consider all the ^'-instances that extend D and satisfy S. The certain answers 
to a conjunctive query Q over S are those that are true of all those instances. 

• The problem is how to compute the certain answers to Q. 

• The solution we propose is to translate Q into a new query Q* and pose it 
to D. The answers to Q* are the certain answers to Q. 

More specifically, our contribution is summarized as follows. 
(a) We define a class of relational dependencies, that we call conceptual depen- 
dencies (CDs) that is able to represent EER schemata; our class is constituted 
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by a subset of the well-known key dependencies (KDs) and inclusion depen- 
dencies (IDs). A broad class of KDs and IDs for which the query answering 
problem under incomplete data is known to be decidable is the class of KDs 
(at most one per relational predicate) and non-key-conEicting inclusion depen- 
dences (NKCIDs), that was introduced in (jCali et al. 2003ap . The problem of 
answering incomplete data under general KDs and IDs is known to be unde- 
cidable (jCah et al. 2003ap . 
(b ) We tackle the problem of query answering under CDs in the presence of incom- 
plete information, under the sound semantics. After reviewing how, also under 
CDs, the chase is a useful tool for query answering, we solve the problem by 
means of query rewriting, in the same fashion as in (jCali et al. 2003b|) . where a 
rewriting for KDs and NKCIDs is presented. We show an algorithm that, given 
a query, rewrites it into another one that encodes relevant information about 
the relational constraints, so that the evaluation of the rewritten query over 
the initial incomplete data returns the certain answers. The rewritten query is 
in (positive) Datalog. 
Note that the chase (which we, however, do not construct in our query answering 
technique) is a conceptual tool whose construction amounts to repairing violations 
of IDs and KDs, the former by adding tuples, and the latter by merging tuples. 
However, repairing is not always possible, and in such cases the chase does not exist 
and query answering becomes trivial. In such cases the repair would require tuple 
deletions: this is captured by semantics such as those in (jBertossi and Bravo 20051 
ICali et al. 2003b)) . 

It is important to notice that the class of CDs docs not fall into the class of 
KDs and NKCIDs. A strong indication (though there is no formal proof) of the 
decidability, that we show in this paper, of the query answering problem under 
CDs (and under the sound semantics) is found in (jCalvanese et al. 1998p . where it 
is shown that query containment in a description logic, capable of representing EER 
schemata, is decidable. However, the technique of (jCalvanese et al. 1998^ does not 
give any indication on the algorithm that may be used to check containment (or, in 
our case, to answer queries). Differently, our technique gives a direct tool for query 
answering that, under certain conditions on the data, provides a low computational 
complexity with respect to the size of the data. 

This paper extends the work in (ICali 2007^ and is organized as follows. We give 
necessary preliminaries in Section [2l we introduce the EER model in Section [3l 
in Section |4] we show how to answer queries with the chase, a formal tool to deal 
with dependencies; the query rewriting technique is described in [Sj together with 
extensions to more general cases. Section |6] concludes the paper, discussing related 
works. 

2 Preliminaries and notation 

In this section we give a formal definition of the relational data model, database 
constraints, conjunctive queries and answers to queries on incomplete data. 

In the relational data model (|Codd 1970p . predicate symbols are used to denote 



4 



A. Call and D. Martinenghi 



the relations in the database, whereas constant symbols denote the objects and 
the values stored in relations. We assume to have two distinct, fixed and infinite 
alphabets Tf and F of fresh constants and non- fresh constants respectively, and we 
consider only databases over ruF/. We note that fresh constants are introduced as 
a technical construct that allows us to build some representatives of databases, as 
will be explained when introducing the chase. In particular, fresh constants are sim- 
ilar to labeled nulls ( Fagin et al. 2005 ) in that they allow representing existentially 
quantified variables and will thus later be associated with Skolem terms. Indeed, 
fresh constants play a role analogous to that of Skolem terms. For non-fresh con- 
stants, which represent the proper constants of the universe, we adopt the so-called 
unique name assumption i.e., we assume that different non- fresh constants denote 
different objects. Instead, fresh constants can be thought of as placeholders for 
non-fresh constants. Therefore, distinct fresh constants can also represent the same 
object. Furthermore, we shall make use of variables from a set Ty- 

A relational schema TZ consists of an alphabet of predicate (or relation) symbols, 
each with an associated arity denoting the number of arguments of the predicate 
(or attributes of the relation). When a relation symbol r has arity n, it can be 
denoted by r/n; in general, the arity of r can also be indicated by arity (r). 

A relational database (or simply database) D over a schema 7?, is a set of relations 
with constants as atomic values. We have one relation of arity n for each predicate 
symbol of arity n in the alphabet TZ. The relation in D corresponding to the 
predicate symbol r consists of a set of tuples of constants, that arc the tuples 
satisfying the predicate r in _D. 

When, given a database D for a schema TZ, a tuple t = (ci, . . . , c„) is in , 
where r € TZ, we say that the fact r(ci, . . . , c„) holds in D. Henceforth, wc will 
interchangeably use the notion of fact and tuple. 

Integrity constraints. Integrity constraints are assertions on the symbols of the 
alphabet TZ that are intended to be satisfied in every database for the schema. The 
notion of satisfaction depends on the type of constraints defined over the schema. 

The database constraints of interest are inclusion dependencies (IDs) and key 
dependencies (KDs) (see e.g. (|Abiteboul et al. 1995P ). We denote with over- lined 
uppercase letters (e.g., X) both sequences and sets of attributes of relations, and 
enclose them between vertical bars to denote the number of attributes in the set or 
sequence (e.g., |^|). Given a tuple t in relation , i.e., a fact r{t) in a database 
D for a schema TZ, and a sequence of attributes X of r, wc denote with t[X] the 
projection (see e.g. (jAbiteboul et al. 1995| ) of t on the attributes in X. 

(i) Inclusion dependencies (IDs). An inclusion dependency c/ between relational 
predicates ri and r2 is denoted by ri[X] C r2[y]. Given a database D with 
values only in F, such a constraint is satisfied in D, written D \^ ai, iff, for 
each tuple ti in , there exists a tuple ^2 in such that ti [X] = t2[Y]. An 
ID is said to be a full-width ID if every attribute of ri occurs in X exactly 
once and every attribute of r2 occurs in Y exactly once. 

(ii) Key dependencies (KDs). A key dependency ax over a relational predicate r 
with arity{r) > 2 is denoted by key{r) = K, where ^ is a nonempty subset of 
the attributes of r. Given a database D with values only in F, such a constraint 
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is satisfied in D, written D \= ax, ^S, for each ti,t2 G r^ such that ti ^ fe, 
we have ^ t2[K*], where K* is any sequence of |^| attributes where 

each attribute in K occurs exactly once. Observe that KDs are a special case 
of functional dependencies (FDs) ([Abiteboul et al. 1995p . Note also that we 
restricted our definition to predicates with arity at least 2, since for predicates 
of smaller arity keys would be always satisfied (under set semantics) . 
Above, we specified when dependencies are satisfied in databases with values only 
in r. For databases with values in F U F/, we define satisfaction of dependencies 
as follows. Given a (key or inclusion) dependency a and a database D with values 
in F U Fj, let 5 be a database obtained from D by replacing every distinct fresh 
constant with a distinct non- fresh constant that does not appear elsewhere in D. 
We have that a is satisfied in D, written D \= a, iS B \= a. 

A database D over a schema TZ is said to satisfy a set of integrity constraints S 
expressed over TZ, written D \= Y,, ii every constraint in S is satisfied by D. 

We now briefly introduce the basics of logic programming and Datalog and refer 
to dLloyd 19871 ) for further details. 

Logic programs. Logic programs are formulated in a language L of predicates 
and functions of nonnegative arity; 0-ary functions are constants. A language L is 
function-free if it contains no functions of arity greater than 0. A term is inductively 
defined as follows: each variable X and each constant c is a term, and if / is an 
71-ary function symbol and ti, . . . , t„ are terms, then /(ii, . . . , i„) is a term. A term 
is ground if no variable occurs in it. The Hcrbrand universe of L, denoted C/j^, is 
the set of all ground terms that can be formed with the functions and constants in 
L. An atom is a formula p(ti, . . . , t„), where p is a predicate symbol of arity n and 
each ti is a term; the atom is ground if all ti are ground. The Herbrand base of a 
language L, denoted Bj^, is the set of all ground atoms that can be formed with 
predicates from L and terms from C/j^. A definite clause is a rule of the form 

where each is an atom. The parts on the left and on the right of arc called 
the head and the body of the rule, respectively. For a rule p, we also denote its 
head by head{p), and its body by body{p). A rule whose body is empty (m = 0) and 
whose head is ground is called a fact. A logic program is a set of definite clauses. 
A clause or logic program is ground if it contains no variables. A clause is range- 
restricted if every variable in it also occurs in its body. A program is range-restricted 
if all its clauses are. 

Each logic program 11 is associated with the language L(n) consisting of the 
predicates, functions, and constants occurring in 11. If no constant occurs in 11, we 
add some constant to L(n) to have a nonempty domain. We simply write Uu and 
Bjj for f^L(n) -^L(n)' respectively. A Hcrbrand interpretation of a logic program 
n is any subset / C Bu of its Herbrand base. Intuitively, the atoms in / are true, 
and all others are false. A Hcrbrand model of Ft is a Hcrbrand interpretation of 
n such that for each rule ^ A^^, . . . ,A^ in H, this interpretation satisfies the 
formula VXi . . . \/Xn{A-^ A ... A ^„) — >■ ^g, where Xi, . . . , X„ are all the variables 
in the rule. 



6 



A. Call and D. Martinenghi 



Let n be a logic program; the immediate consequence operator Tu on 11 is a 
function from the set of all Herbrand interpretations of 11 into itself, defined as 

Tnil) = {Aa e Bn \ there is {A„ ^ A^, . . . in H and {A^, . . . ,A^} C /} 

The sequence = 0, T^^^ = rn(rn)' * ^ always admits a limit, denoted by 
T;q°, which coincides with the least Herbrand model of 11, i.e., the unique minimal 
model of 11 (a model being minimal if no proper subset thereof is also a model) . For 
a set of (ground or non-ground) clauses 11, the immediate consequence operator is 
defined as Tu = T'gj.(n)j where gr{lV) is the set of all clauses obtained from any 
clause in 11 by substituting elements of Uu for the variables. A ground atom A is 
called a consequence of a set 11 of clauses if J. e T^, and we write 11 ^ ^. 

An n-ary query 11^ over a schema TZ consists of an n-ary predicate q (called 
query predicate) and a finite set 11 of definite clauses such that 

(1) q is the head predicate for at least one rule in 11; 

(2) the predicate symbols of the head atoms are not relation symbols in TZ; 

(3) the predicate symbols of the body atoms are either relation symbols in TZ or 
one of the head predicates of a rule in 11. 

The evaluation, called answer, of a query 11, over a database D (which is a set of 
facts), written 11,(1?), is the restriction to q over the least Herbrand model M of 
the logic program 11 U £>, i.e., the largest subset of M containing only atoms with 
predicate q. It will be made clear by the context whether by 11,(1?) we refer to the 
set of facts or to the set of tuples in the answer. 

A Datalog clause is a range-restricted definite clause whose terms arc cither 
variables or constants (no function symbols). A Datalog program is a set of Datalog 
clauses. The notion of query given above also applies to Datalog, since Datalog 
programs arc a specialization of logic programs. 

Conjunctive queries. In general, a relational query is a formula that specifies 
a set of data to be retrieved from a database. In the following we will refer to the 
class of conjunctive queries. A conjunctive query (CQ) of arity n over a schema TZ 
is a Datalog query 11, such that 11 consists of a single rule in which 

(1) the head is of the form q(X), where X is a sequence of distinct variables; 

(2) the constants occurring in the body arc from F; 

(3) the predicate symbols of the atoms in the body are in TZ {q does not occur in 
the body). 

The variables occurring in the head of a conjunctive query arc called distinguished 
variables, the others variables occurring in the body are the non-distinguished vari- 
ables. For simplicity, the answer to a conjunctive query q over a database D for TZ 
is more compactly denoted as q{D) (rather than 11,(1?)). 

The answers we are mainly interested in arc those that contain no fresh con- 
stants, because fresh constants merely represent existentially qualintied variables, 
in the same way as Skolcm terms and labeled nulls ( |Fagin et al. 2005] ). Therefore 
we introduce the notation gPl(I?) for a CQ q to indicate the largest subset of q{D) 
whose tuples contain no fresh constants. 

Homomorphism. A mapping from one set of symbols, Si, to another set of 
symbols, ^2, is a function ^ : Si ^ S2 defined as follows: (i) % (empty mapping) is 
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a mapping; (ii) if is a mapping, then //q U {X — > F}, where X Si and F G 5*2 
is a mapping if does not aheady contain some X — > F' with F 7^ F'. If X ^> F 
is in a mapping /i, we write /^(X) = F. A homomorphism from a set of atoms I?i 
to another set of atoms both over the same relational schema TZ, is a mapping 
/i from rur/UFv toruF/Urv^ such that the following conditions hold: (1) ii 
c e r then fj.{c) ~ c; (2) if c e F/ then //(c) e F U F/; (3) if the atom r(ci, . . . , c„) 
is in then the atom r(/^(ci), . . . , /i(c„)) is in _D2- In the following, sometimes a 
homomorphism may have a codomain different from F U F/ U Fy; for instance, it 
could contain terms from the Herbrand universe of a logic program: in such cases, 
this will be made explicit. 

Fhe notion of homomorphism is naturally extended to atoms as follows. If 
F_ = r(ci,...,c„) is an atom and ji a homomorphism, we define fJ-{F_) = 
r(^(ci), . . . , //(c„)). For a set of atoms, F = {F_ii ■ ■ ■ , Km}j define fi{F) = 
{t^iKi), ■ • • I l^iKm)}- The set of atoms {fJ.{Fi, . . . , fJ-iKm)} is also called image of F 
with respect to fj,. In this case, we say that ^ maps F to y^{F). For a conjunction 
of atoms $ = Fj, . . . , -F„, we use to denote the set of atoms ^J-{{F_l, . . . , f „}). 
An isomorphism is a bijective homomorphism. 

Querying incomplete data. In the presence of incomplete data, a natural way 
of considering the problem of query answering is to adopt the so-called sound se- 
mantics or open-world assumption (jReiter 1978( ILenzerini 2002p . In this approach, 
the data are considered sound but not complete, in the sense that they constitute 
a piece of correct information, but not necessarily all the relevant information. In 
such a case, we need to reason in the presence of incomplete information, thus con- 
sidering a theory (given by the schema and constraints) having multiple models. 
In our context, under relational constraints, it often happens that the data do not 
satisfy the constraints, especially in information integration, where heterogeneous 
data are represented by a single schema. Reasoning with incomplete information 
allows us to address those constraint violations that are caused by the absence of 
elements from the database (such as inclusion dependencies). (Note that violations 
of other kinds of constraints, such as key dependencies, cannot be addressed in this 
way.) More formally, we restrict our attention to the so-called certain answers to a 
query: given a finite database D, the answers we consider are those that are true in 
all models, i.e., in all the databases that contain D and satisfy the dependencies. 
In the following, we shall always assume that the initial database has finite size, 
while no finiteness assumptions is made on the models. 

Definition 1 {Certain answer) 

Consider a relational schema TZ with a set of dependencies E, and a finite database 
D for TZ. Let g be a conjunctive query of arity n over TZ. A 71-tuple < is a certain 
answer to q w.r.t. D and E if and only if, for every database B for TZ such that 
B \=Yj and B ^ D, we have t G q{B), and t consists of constants in F. Fhe set of 
certain answers is denoted by ans{q, E, _D). ■ 

Example 1 

Consider a relational schema TZ, here inspired by (|Cali et al. 2003bp . with the 
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relations player/2 (player-team pairs) and team/2 (team-city pairs), a set of 
IDs S = {player[2] C team[l]}, and a database D consisting of the facts 
player(pirZo, acMilan), p\ayer{totti, roma), team(acAjfiZan, milan). 

The ID in S tells us that roma is the name of some team in every database 
B D D such that B \= T,, i.e., each such database B must contain at least a fact of 
the form team (roma, c), where c is some value in F. 

Consider now the query q{X) ^ team(X, Y), asking the names of the teams 
in the database. By the above considerations, the set of certain answers is 
{acMilan, roma}. 

Let F_ be the fact team(roma, a), where a is a value in Tf. As we will show 
in Section SJ there is a homomorphism from D U {F} to every database B' Z) D 
such that B' \= S. Consider, e.g., such a database B' = {player(j>ir/o, acMiZan), 
player(totti, roma), team{acMilan, milan), team(roma, rome), team{psg,paris)}. 
There is a homomorphism A from D U {F} to B' such that (i) X{a) = rome, 
(ii) X{F_) = tea m(ro77ia, rome), (in) X sends all facts in D into themselves, and 
(iv) B' = X[D U {F}) U {team(psg,paris)}. ■ 

We will see that, under the database dependencies we consider in this paper, the 
problem of query answering is mainly complicated by two facts: (i) the number of 
databases that satisfy S and that include D can be infinite; (ii) there is no bound 
to the size of such databases. 

Definition 2 [Querying incomplete databases) 

Consider a relational schema Ti. with a set of dependencies E, and a finite database 
D for TZ. Let g be a conjunctive query of arity n over TZ. The problem of querying in- 
complete databases under E is the problem of determining all tuples in ans{q, E, D). 
The corresponding decision problem is determining, given also a tuple t of arity n, 
whether t G ans{q,Y,, D). ■ 



3 The Conceptual Model 

In this section we present the conceptual model we shall deal with in the rest of 
the paper, and we give its semantics in terms of relational database schemata with 
constraints. 

Such model incorporates the basic features of the ER model (IChen 1976P and 
00 models, including subset (or is-a) constraints on both entities and relation- 
ships. It is an extension of the one presented in (ICali et al. 2001[) . and here we 
use a notation analogous to that of (jCali et al. 200ip . Henceforth, we will call such 
a model Extended Entity-Relationship (EER) model, and we will call schemata 
expressed in the EER model Extended Entity-Relationship (EER) schemata. 

An EER schema consists of a collection of entity, relationship, and attribute 
definitions over an alphabet Sym of symbols. The alphabet Sym is partitioned into 
a set of entity symbols (denoted by Ent) , a set of relationship symbols (denoted by 
Rel), and a set of attribute symbols (denoted by Att). 

An entity dehnition has the form 
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entity E 

isa: El,. . . ,Eh 
participates(> 1): Ri : ci, 
participates(< 1): R[ : c[, 

where: (i) E G Ent is the entity to be defined; (ii) the isa clause specifies a set of 
entities to which E is related via is-a (i.e., the set of entities that are supersets of e); 
(Hi) the participates(> 1) clause specifies those relationships in which an instance 
of E must necessarily participate; and for each relationship Ri , the clause specifies 
that E participates as c^-th component in Rf, (iv) the participates(< 1) clause 
specifies those relationships in which an instance of E cannot participate more than 
once (components are specified as in the previous case). The isa, participates(> 1) 
and participates(< 1) clauses are optional. Every relationship mentioned in the 
participates(< 1) and participates(> 1) clauses must then be defined accordingly, by 
mentioning the participating entity as one of the entities of the relationship in a 
relationship definition. A relationship dcGnition has the form 

relationship R among Ei, . . . , En 

isa: 1, ... ,ji „],... , Rh[jhi,- ■ -Jhn] 

where: (i) R £ Rel is the relationship to be defined; (ii) the n entities of Ent, 
with n > 2, listed in the among clause are those among which the relationship 
is defined (i.e., component i of i? is an instance of entity Ei); (Hi) the isa clause 
specifies a set of relationships to which R is related via is-a; for each relation Ri , we 
specify in square brackets how the components [1, . . . , n] are related to those of e^, 
by specifying a permutation [jn, . . . ,jin] of the components oi Ei; (iv) the number 
n of entities in the among clause is the arity of R. The isa, clause is optional. An 
attribute deGnition has the form 

attribute A of X 

qualification 

where: (i) A G Att is the attribute to be defined; (ii) X is the entity or relationship 
with which the attribute is associated; (Hi) qualification consists of none, one, or 
both of the keywords functional and mandatory, specifying respectively that each 
instance of X has a unique value for attribute A, and that each instance of X needs 
to have at least a value for attribute A. If the functional or mandatory keywords 
are missing, the attribute is assumed by default to be multivalued and optional, 
respectively. 

For the sake of simplicity, and without any loss of generality, we assume that 
in our EER model attributes of entities or relationships have unique names in a 
schema. We also assume that every attribute or entity takes values from an infinite 
domain. 

The semantics of an EER schema C is defined by (i) associating a relational 
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Fig. 1. EER schema for Example [2] 

schema TZ to it, and (ii) specifying when a database for TZ satisfies all constraints 
imposed by the constructs of the schema C. 

We now formally define the relational schema associated with an EER diagram. 
Such a relational schema is defined in terms of predicates, which represent the 
so-called concepts (entities, relationships, and attributes) of the EER schema. 

(a) Each entity E in C has an associated predicate e of arity 1. Informally, a fact 
of the form e(c) asserts that c is an instance of entity E. 

(b) Each attribute A for an entity E in C has an associated predicate a of arity 2. 
Informally, a fact of the form a{c, d) asserts that d is the value of attribute A 
associated with c, where c is an instance of entity E. 

(c) Each relationship R involving the entities Ei, . . . , En in C has an associated 
predicate r of arity n. Informally, a fact of the form r(ci, . . . , c„) asserts that 
(ci, . . . , c„) is an instance of relationship R, where ci, . . . , c„ are instances of 
El, . . . ,En respectively. 

(d) Each attribute A for a relationship R among the entities Ei,...,En in C 
has an associated predicate a of arity n + I. Informally, a fact of the form 
a(ci, . . . , c„, d) asserts that d is a value of attribute A associated with the 
instance (ci, . . . , c„) of relationship R. 

Notice that, in our particular relational representation, entities are represented 
by unary predicates, which can be thus seen as "surrogate keys" , i.e., attributes that 
are identifiers and do not have any real-world meaning. With this representation, 
user-defined key attributes are not necessary. 

In the following, the expression "query over an EER schema C" will indicate a 
query over the relational schema associated wih C according to the above points 
(a) to (d). 

Example 2 

Consider the EER schema C defined as follows. 

entity Employee 

participates(> 1): WorksJn : 1 

participates(< 1): WorksJn : 1 
entity Manager 

isa: Employee 

participates(> 1): Manages : 1 
participates(< 1): Manages : 1 
entity Dept 

relationship WorksJn among Employee, Dept 
relationship Manages among Manager, Dept 
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isa: WorksJn[l,2] 
attribute emp_name of Employee 
attribute dept_name of Dept 
attribute since of Works_in 

Figure [T] depicts C in the usual graphical notation for the ER model (components 
are indicated by integers for the relationships). The relational schema TZ associ- 
ated with C consists of the predicates manager/1, employee/1, dept/1, works_in/2, 
manages/2, emp_name/2, dept_name/2, since/3. The schema describes employees 
working in departments of a firm, and managers that are also employees, and man- 
age departments. Managers who manage a department also work in the same de- 
partment, as imposed by the is-a among the two relationships; the permutation 
[1,2] labeling the arrow denotes that the is-a holds considering the components 
in the same order (in general, any permutation of (l,...,n) is possible for an 
is-a between two ra-ary relationships). The constraint (1,1) on the participation 
of Employee in Works_ln imposes that every instance of Employee participates at 
least once (mandatory participation) and at most once (functional participation) 
in Works_ln; the same constraints hold on the participation of Manager in Manages. 
Suppose we want to know the names of the managers who manage the toy depart- 
ment (named toy -dept). The corresponding conjunctive query over C is 

q{Z) ^ manager(X), emp_name(X, Z), manages(X, F), dept( F), 

dept_name{Y , toy _dept) ^ 

The intended semantics of an EER schema is immediately captured by a transla- 
tion into the relational model that imposes additional constraints to the associated 
relational schema. Once we have defined the relational schema TZ for an EER schema 
C, we give the semantics of each construct of the EER model; this is done by speci- 
fying what databases (i.e., extensions of the predicates of TZ) satisfy the constraints 
imposed by the constructs of the EER diagram. We do that by making use of the 
relational database constraints introduced in Section [2l We remind the reader that 
each entity E in C has an associated relational predicate e in TZ, denoted with the 
same letter, lowercase instead of uppercase; similarly, an attribute A has associated 
a predicate a and a relationship R a predicate r. 

(1) For each attribute A/2 for an entity E in an attribute definition in C, we have 
the ID o[l] C e[l]. 

(2) For each attribute A/{n + 1) for a relationship R/n in an attribute definition 
in C, we have the ID a[l, . . . , n] C r[l, . . . , n]. 

(3) For each relationship R involving an entity Ei as i-th component according to 
the corresponding relationship definition in C, we have the ID r[i] C ejl]. 

(4) For each mandatory attribute A/2 of an entity E in an attribute definition in 
C, we have the ID e[l] C a[l]. 

(5) For each mandatory attribute A/{n -f 1) of a relationship R/n in an attribute 
definition in C, we have the ID r[l, . . . ,n\ C a[l, . . . , n]. 

(6) For each functional attribute A/2 of an entity E in an attribute definition in 
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C, we have the KD key{a) = {1}, since there cannot be more than one value 
for attribute A that is assigned to a single instance of E. 

(7) For each functional attribute A/{n + 1) of a relationship iZ/n in an attribute 
definition of C, we have the KD key{a) = {1, . . . , n}, since there cannot be 
more than one value for attribute A that is assigned to a single instance of R. 

(8) For each is-a relation between entities Ei and E2, in an entity definition in C, 
we have the ID ei[l] C 62 [1], since the is-a relation specifies a set containment 
between entities Ei and i?2- 

(9) For each is-a relation between relationships i?i and R2, where components 
1, . . . , n of i?i correspond to components ji, . . . ,j„, in a relationship definition 
in C, we have the ID: ri[l, . . . , n] C r2[ji, . . . , Jn], since the is-a relation specifies 
a set containment between relationships Ri and i?2- 

(10) For each mandatory participation (participation with minimum cardinality 1) 
as c-th component of an entity E in a relationship R, specified by a clause 
participates> 1: i? : c in an entity definition in C, we have the ID e[l] C r[c]. 

(11) For each participation with maximum cardinality 1 as c-th component of an 
entity i? in a relationship R, specified by a clause participates< 1: i? : c in an 
entity definition in C, we have the KD key{r) = {c}. 

Definition 3 {Conceptual dependencies) 

Consider a schema TZ and a set of dependencies E = E/ U Ex, where E/ is a set of 
inclusion dependencies and Y^k is a set of key dependencies expressed over 72.. We 
say that E is a set of conceptual dependencies (CDs) if there exists an EER schema 
C with associated relational schema TZ such that E is obtained from C by applying 
the above points (l)-(ll). ■ 

Example [H ( cont.) 

Consider again the EER schema shown in Figure [T] The set of conceptual depen- 
dencies associated with the EER schema C to be imposed on the schema TZ consists 
of the following dependencies. 



CTl 


dept_name[l] 


C 


dept[l] 


(by rule 1) 


<72 


emp_name[l] 


c 


employee[l] 


(by rule 1) 


c^a 


since[l, 2] 


c 


worksJn[l, 2] 


(by rule 2) 




worksJn[l] 


c 


employee[l] 


(by rule 3) 


CT5 


worksJn[2] 


c 


dept[l] 


(by rule 3) 


0-6 


manages[l] 


c 


manager[l] 


(by rule 3) 


CT7 


manages[2] 


c 


dept[l] 


(by rule 3) 


<78 


manager[l] 


c 


employee[l] 


(by rule 8) 




manages[l, 2] 


c 


worksjn[l, 2] 


(by rule 9) 


ClO 


employee[l] 


c 


worksJn[l] 


(by rule 10) 


0-11 


manager[l] 


c 


manages[l] 


(by rule 10) 


0'12 


A:ej/(worksJn) 




{1} 


(by rule 11) 




fcei/(manages) 




{1} 


(by rule 11) 



Now we characterize the form of relational dependencies resulting from the cncod- 
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ing of EER schemata into relational schemata, the proof of which is straightforward. 
Proposition 1 

Consider a schema TZ and a set of dependencies E = S/ U where E/ is a set 
of inclusion dependencies and is a set of key dependencies expressed over TZ. 
Then, S is a set of CDs if and only if we can partition TL in three sets TZr, TZe, 
and TZa such that the following holds. 

(a) All predicate symbols in 7^^; arc unary. 

(b) All predicate symbols in TZn and TZa have arity at least 2. 

(c) The dependencies ui'Sk have one of the following forms 

(1) key{r) — {i}, with 1 < i < arity{r), where r G TZu. 

(2) key{a) = {1, . . . , n}, where a € TZa and n = arity{a) — 1. 

(d) The dependencies in S/ have one of the following forms 

(1) ei[l] C e2[l], where {61,62} C TZe- 

(2) e[l] C r[i], where e G TZe, t G TZr, and 1 < i < arity{r). 

(3) r[i\ C e[l], where r G TZr, e G TZe, and 1 < i < arity{r). 

(4) ri[l, . . . , fc] C r2[ii, . . . , jfc], where {n, r2} C 7^J^, arity{ri) = arity{r2) = A;, 
and (ii, . . . , 4-) is a permutation of (1, ... , fc). 

(5) a[l] C e[l], where a G 7?.^ and e G 

('6'^ a[l, . . . , n] C r[l,...,n], where a G TZa, f G 7?,/j, and n = arity(r) = 

arity (a) — 1. 
("Tj e[l] C a[l], where e G 7^i^ and a G 7^A• 

(8) r[l,...,n] C a[l,...,n], where r G 7?./?, a G TZa, and n = arity(r) = 
arity (a) — 1. 

(^e^ For every predicate r G TZr and for 1 < * < arity (r), there exists an ID 
r[i] C ei[l] in E/ such that Ci G T^b and there is no e'^ G TZe, with e,, 7^ e^', 
such that r[i] C e^[l] is in E/. 

(/^ For every predicate a G 7?.yi, there exists an ID a[l, . . . ,n] C . . . , n] in E/ 
such that p G 7?./j U TZe and n = arity(p) — arity(a) — 1. and there is no 
p' G TZr U T^s, with p 7^ p', such that a[l, . . . ,n] C . . . , n] is in E/. 

(5^ For every ID e[l] C r[i] in E/, with e G r G 7?.^:, and 1 < i < arity{r), 
there is an ID r[i] C e[l] in E/. 

(h) For every ID r[l,...,7i] C a[l,...,n] in E/, with r G 7?.ij, a G TZa, and 
71 = arity{r) = arity{a) — 1, there is an ID a[l, . . . , n] C r[l, . . . , n] in E/. 

(i) For every ID e[l] C a[l] in E/, with e G T^b, a G TZa, and arity{a) = 2, there 
is an ID a[l] C e[l] in E/. 

Being able to encode EER schemata into relational ones, henceforth we will deal 
with relational schemata only. 

The problem of querying incomplete databases under KDs and IDs is in general 
undecidable (jCali 20031 ICali et al. 2003a| . The largest subclass of functional depen- 
dencies and inclusion dependencies for which query answering is known to be decid- 
able is the class of keys and non-key conflicting inclusion dependencies (jCali 2003| 



Functional dependencies are a generalization of key dependencies l|Abiteboul et al. 1995^ . 
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ICali et al. 2003a|) . The main contribution of the present paper is a technique for 
solving the problem of querying incomplete databases under CDs. This is relevant 
because EER schemata are very important in practice and CDs are able to capture 
them. Our solution consists in a technique for rewriting the given query such that 
the evaluation of the rewritten query returns the certain answers. 

Note that our definition of certain answer, defined in Section [2j considers 
databases that may also be of infinite size. In the database literature, interest is 
typically devoted to databases of finite size only. In particular, the certain answers 
under finite models can be defined as follows. 

Definition 4 {Certain answer under finite models) 

Consider a relational schema TZ with a set of dependencies E, and a finite database 
D for TZ. Let g be a conjunctive query of arity n over TZ. A n-tuple i is a certain 
answer under Enite models to q w.r.t. D and E if and only if, for every Enite 
database B for TZ such that 5 ^ E and B D D, we have t G q{B), and t consists 
of constants in F. The set of certain answers under finite models is denoted by 
ansf{q,T,,D). m 

We now show that under CDs, in general, ans{q, E, D) ^ ansf{q, E, D). 

Example 3 

Consider the following EER schema: 

entity B 

participates(> 1): R : 2 
entity A 

isa: B 

participates(< 1): R : 1 
relationship R among A, B 



This corresponds to the following set of CDs: 

r[l] C a[l], 
r[2] C 
E=<! a[l] C 

m C r[2], 
key{r) = {1} 

It can be straightforwardly seen that, for every Gnitc database B D D such that 
i? 1= E, we have a(c) S B. Consequently, (c) G ansf{q, E, D), where q is the query 
q{x) ■(r- a{x). On the other hand, consider the following database Doo. 

Doo ^ { b{c), r(ci, c), a(ci), 6(ci), r(c2, ci), a(c2), 6(22), r(c3, C2), . . . 
. . . , a(cj), 6(cj), r(cj+i, c^), . . .} 

We have that f 00 2 D and Dqo \= ^, but a(c) ^ Dao and thus (c) ^ ans{q, E, D), 
therefore we immediately have ans{q, E, /)) 7^ ansf{q, E, £)). ■ 

Henceforth, we shall not restrict our attention to finite databases only, thus allowing 
for models of infinite size. 
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4 Query Answering with the Chase 

In this section wc introduce the notion of chase, which is a fundamental tool for 
dealing with database constraints ()Maier et al. 1979| IMaier et al. 1981] IVardi 1983| 
[Johnson and Klug 1984[ ); then we show some relevant properties of the chase under 
conceptual dependencies (CDs) regarding conjunctive query answering, that will 
pave the way for the query rewriting technique that will be presented in the next 
section. 

The chase (|Maier et al. 19791 Johnson and Klug 1984 1 is a key concept in par- 



ticular in the context of functional and inclusion dependencies. Intuitively, given 
a database, its facts in general do not satisfy the dependencies; the idea of the 
chase is to convert the initial facts into a new set of facts constituting a database 
that satisfies the dependencies, possibly by collapsing facts (according to KDs) 
or adding new facts (according to IDs). When new facts are added, some of the 
constants need to be fresh, as we shall see in the following. The technique to 
construct a chase is well known for functional and inclusion dependencies (see, 
e.g., ( [Johnson and Klug 1984D ); however we detail this technique here, since we 
have adapted it to the simpler case of KDs instead of functional dependencies. 



4-1 Construction of the chase. 

In order to construct the chase for a database for a relational schema TZ with 
dependencies S = E/ U , where E/ is a set of inclusion dependencies and 
is a set of key dependencies, we use the following rules for IDs and KDs, which 
apply to a set of facts (i.e., a database instance) and produce a new set of facts. 
We indicate as D the set of facts before the application of a rule. 

Inclusion Dependency Chase Rule. Let r, s be relational symbols in TZ. 
Suppose there is a tuple i in r^, and there is an ID cr 6 E/ of the form r[Xr] C s[Xs]. 
If there is no tuple t' in such that t'[Xs] = t[^r] (in this case we say the rule 
is apphcable), then we add a new tuple tchase hi such that tchase[Xs] = t[Xr], 
and for every attribute Ai of s such that Ai ^ Xs, tchase[Ai] is a fresh value in 
Tf that follows, according to lexicographic order, all the values already present in 
the chase. Note also that we assume that all the values in Fy follow, according to 
lexicographic order, all the values in F. 

Key Dependency Chase Rule. Let r be a relational symbol in TZ. Suppose 
there is a KD k of the form key{r) = X . If there are two distinct tuples t, t' £ r^ 
such that t[X] = t'[X] (in this case we say the rule is applicable), make the symbols 
in t and t' equal in the following way. Let Y ~ Yi, . . . , he the attributes of r 
that are not in X; for all i G {!,...,£}, make t[Yi] and t'[Yi] merge into a combined 
symbol according to the following criterion: (i) ii both arc constants in F and they 
are not equal, the rule fails to apply and the chase construction process is halted; 
(ii) if one is in F and the other is a fresh constant in Fy, let the combined symbol 
be the non- fresh constant; (Hi) if both are in Fy, let the combined symbol be the 
one preceding the other in lexicographic order. Finally, replace all occurrences in 
D of t[Yi] and t'[Yi] with their combined symbol. 
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Now we come to the formal definition of the chase, which uses the notion of level 
of a tuple; intuitively, the lower the level of a tuple, the earlier the tuple has been 
constructed in the chase. In order to make all steps in the construction of the chase 
univocally determined by the definition, we assume that all facts can be sorted 
according to lexicographic order (e.g., by using a string comprising the predicate 
name and the names of all constants in the fact), and so can all pairs of facts as 
well as all dependencies (e.g., also by using strings that encode them). 

Definition 5 (Chase) 

Let I? be a database for a schema TZ, and E a set of CDs. We call chase of D 
according to S, denoted chase^{D), the database constructed from D by repeatedly 
executing the following steps, while the KD and ID chase rules are applicable; every 
tuple t E chase^(D) is also assigned a level, denoted by level{t)\ ii t E D, then 
level{t) = 0. 

(1) While there are pairs of facts on which the KD chase rule is applicable, take 
the pair ti, t2 such that min[level{ti) ,level{t2)) is minimal (if there is more than 
one, take the pair that comes first in lexicographic order) and apply the KD chase 
rule on ii, t2 w.r.t. a KD k (if there is more than one KD for which the KD chase 
rule is applicable on <i, ^2, take the KD that comes first in lexicographic order) so 
that ti, t2 collapse into a fact ^3; if the rule fails, the chase cannot be constructed 
and, thus, does not exist; else we define level{t^) = min{level{ti),level{t2)). 

(2) If there arc facts on which the ID chase rule is applicable w.r.t. a full-width 
ID, choose the one (say t') at the lowest level that lexicographically comes first and 
apply the ID chase rule on t' w.r.t. a full-width ID a (if there is more than one 
full- width ID for which the ID chase rule is applicable on t' , take the full- width ID 
that comes first in lexicographic order) to generate a new fact t"] else, if there are 
facts on which the ID chase rule is applicable, choose the one (say t') at the lowest 
level that lexicographically comes first and apply the ID chase rule on t' w.r.t. an 
ID a (if there is more than one ID for which the ID chase rule is applicable on t' , 
take the ID that comes first in lexicographic order) to generate a new fact t" . We 
define level{t") = level{t') + 1. ■ 

Note that, according to Definition [5l the chase is constructed by applying the KD 
chase rule as long as possible, then the ID chase rule exactly once, then the KD 
chase rule as long as possible, etc., until no more rule is applicable. Also, the par- 
ticular sequence of chase rules to be applied is determined according to a precise 
lexicographic order, so that there is exactly one chase for a given initial database 
and set of CDs. 

As we pointed out before, the aim of the construction of the chase is to make 
the initial database satisfy the KDs and the IDs, by repairing the violations of 
the constraints. The obtained (possibly infinite) instance is a representative of all 
databases that are a superset of the initial database and satisfy the constraints. 
Notice that key dependency violations cannot be repaired by constructing a chase, 
but would require an explicit treatment, as explained in Section [5. 4( in such a case 
the chase does not exist. It is easy to see that chase^{D) can be infinite only if the 
set of IDs in S is cyclic (jAbiteboul et al. 1995||Johnson and King 1984D , i.e., if there 
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is a sequence of IDs in S of the form ri[Xi] C r2[Xj'], r2[X2] C r3[X2]; ■ • • i '^n[^n] ^ 
r„+i [X^] and r„+i = ri. In the following we will show how the chase can be used in 
computing the answers to queries over incomplete databases under dependencies. 



4-2 Query Answering and the Chase. 



In their milestone paper (Johnson and Klug 19841, Johnson and King proved that 



under certain subclasses of KDs and IDs, a containment between two conjunctive 
queries gi and q2 can be tested by verifying the existence of a so-called query ho- 
momorphism. Roughly speaking, such a homomorphism has to map the body of q2 
to the chase of the body of qi , and the head of qz to the head of qi . Johnson and 
Klug proved that, in order to test containment of CQs under IDs alone or key-based 
dependencies (a special class of KDs and IDs), it is sufficient to consider a finite, 
initial portion of the chase. The result of ( [Johnson and Klug 1984 ) was extended 



in (jCali et al. 2003a[) to a broader class of dependencies, strictly more general than 
keys with foreign keys: the class of KDs and non-key-conB.icting inclusion depen- 
dencies (NKCIDs) (jCah 2003|) . that behave like IDs alone because NKCIDs do 
not interfere with KDs in the construction of the chase. The above results about 
query containment (see, e.g., (jCali et al. 2008P ) can be straightforwardly adapted 
to solve the decision problem of answering on incomplete databases, since, as it 
will be shown later, the chase is a representative of all databases that satisfy the 
dependencies and are a superset of the initial data. 

In a set of CDs, IDs are not non-key-conflicting (or better key- 
conRicting), therefore the decidability of query answering cannot be deduced 
from dJohnson and Klug 1984] ICah et al. 2003a|) . (though it can be derived 
from (jCalvanese et al. 1998^ . as we shall discuss later). In particular, under CDs, 
the construction of the chase has to face interactions between KDs and IDs; this 
can be seen in the following example, taken from (jCali 2006[) . 

Example 4 

Consider again the EER schema of Example [2j Suppose we have an initial (incom- 
plete) database, with the facts manager(m) and works_in(m, d). If we construct the 
chase, we obtain the facts employee(m), manages(m, ai), worksjn(m, ai), dept(ai), 
where ai is a fresh constant. Observe that m cannot participate more than once 
in worksJn, so we deduce ai — d. We must therefore replace ai with d in the 
rest of the chase, including the part that has been constructed so far. Therefore, 
chase-s{D) = {manager(m), worksjn(m, rf), employee(m), manages(m, d), dept{d)}. 



In spite of the potentially harmful interaction between IDs and KDs, analogously 
to the case of IDs alone (jCali et al. 2004p . it can be proved that, in the presence of 
CDs, the chase is a representative of all databases that are a superset of the initial 
(incomplete) data, and satisfy the dependencies; therefore, it serves as a tool for 
query answering, as shown in Theorem [1] below. 

As was made explicit in Definition [SJ the chase may not exist if some application 
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of the KD rule fails. This may happen even when the database satisfies the key 
dependencies, as shown in the next example. 

Example 5 

Consider two binary predicates r and s, derived from two binary relationships R 
and S, for which there is an is-a relation (that generates the ID r[l, 2] C s[l, 2]) and 
a participation with maximum cardinality 1 for the first component of s (that gen- 
erates the KD key{S) = {1}). The mentioned ID and KD arc a fragment of a set of 
CDs that is sufficient to show that the chase may not exist even if the initial database 
satisfies the dependencies. Let the initial database be = {r{a, b), s{a, c)}. Al- 
though D satisfies the KD, the chase rule for the ID generates a tuple s{a, 6), which 
triggers a (failing) KD chase rule application on s{a, h) and s(a, c). Therefore the 
chase for this database and constraints docs not exist. ■ 

Since the chase may be of infinite size, it would seem that checking whether a chase 
exists is semi-decidable. Indeed, in the general case of IDs and KDs it is not known 
whether it is decidable to check whether the chase exists. 

However, the following lemma shows that termination of the chase under CDs is 
decidable; we will then use it to state some of our results. 

Lemma 1 

Let D be a database for a relational schema TZ and S a set of CDs over 71. Then, 
checking whether chases{D) exists is decidable in time polynomial in the size of 
D. 

Proof 

We start by observing that the application of a unary ID (i.e., an ID that involves 
a single attribute) cannot cause a failure of the chase by violation of a KD: indeed, 
considering a generic unary ID ri[fci] C r2[A;2], the only possible violation of a KD 
due to the application of this ID is when we have the KD key{r2) = {fe}; however, 
such violation never causes a failure of the chase, since all values in the added tuple 
that arc in positions different from k2 arc all fresh constants. Now, let us indicate 
with Eij the IDs in S that derive from is-a relations among relationships; they are 
IDs of the form ri[l, . . . , C r2[ji, . . . , j„], where ji, . . . , jn is a permutation of 
1, . . . n and both ri and r2 have arity n. It is immediately seen that: 
(i) Facts in the chase of the form r(ci, . . . , c„), where r is a relation belonging 
to the set TZr of n-ary relationships in the conceptual schema, contain 

• only non-fresh constants, 

• only fresh constants, or 

• exactly one non- fresh constant (possibly occurring more than once). 
No other case is possible. This can be shown by induction on the number of 
application of chase rules. Consider also that 

• Facts regarding (unary) predicates associated with entities may cither 
contain a fresh or a non- fresh constant. 

• For facts regarding predicates associated with n-ary attributes, we have 
that the last position may be occupied by either a fresh or a non-fresh 
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constant, and the first n positions behave like a fact for a relation in 
TZr (i.e., they contain only non- fresh constants, only fresh constants, or 
exactly one non- fresh constant). 
In the base case (no application), we only have facts in D, which only contain 
non-fresh constants. Suppose now, by inductive hypothesis, that, after i ap- 
plications of the chase rules, the facts are only of the forms mentioned above. 
The inductive step consists in showing that no new application of a chase rule 
produces facts that are not in one of the forms mentioned above. To see this, 
it suffices to verify this for all forms (l)-(ll) of dependencies that may occur 
in CDs, as described in SectionS) This is immediate for (I)-(IO). As for (11), 
consider that a KD rule can be applied on two tuples ti and ^2 for a relation 
r G TZr in two cases: 

• ti and t2 both have in the position of the key the same non-fresh con- 
stant. In this case the inductive step immediately follows, by either a 
failure of the chase or the generation of a new tuple containing exactly 
one non- fresh constant (possibly occurring more than once). 

• ti and t2 both have in the position of the key the same fresh constant. 
The inductive step follows immediately, unless ti contains exactly one 
non-fresh constant, say c, and t2 contains exactly one non-fresh con- 
stant, say d, with d ^ c, because then the KD rule could produce a 
tuple containing two different non-fresh constants. However, this case 
cannot occur. To see this, it suffices to show that if two tuples ti and 
<2 for r G TZfi have a fresh constant in common, then they cannot have 
different non-fresh constants. This can, again, be shown by induction on 
the applications of chase rules for dependencies of the forms (l)-(ll). 
Basically, the only way for tuples of relations in Tin to have fresh con- 
stants in common is to apply chase rules on dependencies of the forms 
(9)-(ll). 

— With form (9), the application of the ID chase rule on a cycle of is-a 
relations between relationships may generate two tuples sharing a 
fresh constant. However, only permutations of the positions can take 
place, but the constants are unchanged. 

— Two applications of the ID chase rule on two different IDs of form 
(10) for the same entity and the same relationship but on two dif- 
ferent components can generate two tuples sharing a fresh constant. 
However, all the other constants will also be fresh. 

— The application of a KD chase rule for a KD of form (11) is now 
trivially harmless by inductive hypothesis. 

(ii) All facts of the form r(ci, . . . , c„), with r G TZr, that contain only non- fresh 
constants are obtained by applying (possibly several times) the ID chase rule 
for IDs in S/j to facts in the initial database (constituted in turn by tuples 
containing only non- fresh constants). 

(iii) By what stated in point (i) above, the only way of causing a failure in the 
chase construction (apart from violations of key constraints already in _D) is to 
apply an ID in to a tuple having only non-fresh constants, thus introducing 
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a (non-repairable) violation of some KD due to the presence of another tuple 
having only non-fresh constants; in all other cases, every violation of a KD is 
repaired by applications of the KD chase rule. 
This said, it follows that if there is no failure in chase^„ {D), there is no failure 
in chase^{D). It remains to check whether chase-Sj^{D) is finite: it is easily seen 
that it indeed cannot be infinite, since every tuple in chase^j^{D) is of the form 
r(ci, . . . , c„), with r G TZr, and where ci, . . . , c„ are obtained by a permutation of 
di, . . . , dn, where the fact r' (di, . . . , dn), with r' S TZr, is in the initial database 
D. The maximum depth of chase-^j^lD) is W\, where W is the maximum arity 
of predicates in TZ. It is also straightforward to see that the size of chaseY.i, {D) is 
polynomial in \D\ (size of D, i.e., number of tuples of D), and that chasespiiD) 
can be constructed in time polynomial in IZ^I. By the above considerations, it is 
immediately seen that chasesniD) fails iff chasej:{D) fails. The thesis follows. □ 

Lemma 2 

Let D be a database for a relational schema TZ and E a set of CDs over TZ such 
that chase^(D) exists. Then chase^{D) |= E. 

Proof 

Trivial, by the construction of Definition [S] □ 

The following lemma is a technical result that will be used in the proof of The- 
orem [1] Informally, it shows that the chase of a database D, when it exists, is a 
powerful tool for answering queries: for every solution B (database that is a super- 
set of the given incomplete database D and that satisfies the constraints), there is 
a homomorphism that sends the chase of D onto B. This result follows from the 
results in ( Fagin et al. 2005 IDeutsch et al. 2008^ , but we provide a direct proof for 



the sake of completeness. 
Lemma 3 

Let D be a database for a relational schema TZ and S a set of CDs over TZ such that 
chase^{D) exists. Then, for every database B for TZ such that B \= H and B D D, 
we have that there exists a homomorphism from chase^{D) to B. 

Proof 

Similarly to what is done for the analogous result in (jCali et al. 20Q4p . we proceed 
by induction on the applications of the (ID or KD) chase rules. We define a ho- 
momorphism fj, inductively, and we simultaneously show that for each relation r 
of arity n in TZ, and each tuple (ci, . . . , c„) constituted by elements in F U F/, if 
(ci,...,c,0 G r^hase^(D)^ then (^(ci), . . . , /^(c„)) G r^. 

(1) Base case. After applications of a chase rule, the constructed part of the 
chase coincides with D. Since B D D, the mapping fj. that maps each constant in 
D into itself is a homomorphism from the constructed part of the chase to B. 

(2) Inductive step. First case: the applied rule is the ID chase rule. Suppose 
that in the application of the rule, we are inserting the tuple t* = (ai, . . . , a„) in 
chaseY,{D), where r has arity n, ai G F/ for each i 7^ fc, G FUF/, and the tuple 
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is inserted in r'^hasej^{D) because of the ID C r[k] (other forms of IDs among 
those described in points (l)-(ll) in Section [3] are deah with similarly). Since we 
are applying the rule because of the dependency wlj] C r[fc], there is a tuple t 
in yjChases{D) g^g]-^ ^}^^^ ^jjj — q,^ gy inductive hypothesis, there is a constant Ck 
in r such that /i(afc) = Cfc, and there is a tuple t' S such that for each i, 
t'[i] ~ with t'\j\ = /i(afe) = Cfc. Because of the constraint C r[k], and 

because B satisfies the constraints, there is a tuple t" in with t"[k] = c^; let 
then = (ci, . . . , c„). Then, we set /^(a^) = Cj for each i ^ k, and wc can conclude 
that iJL{t*) e r^. 

Second case: the applied rule is the KD chase rule. By inductive hypothesis, there 
exists a homomorphism n mapping the two tuples t,t' on which the KD rule is 
applied into tuples ^{t) and /x(i') in B. Note that, since the KD rule is applicable 
to t, t' and i? 1= S, we must have ^(i) = IJ-{t')- In the chase, t and t' are then 
replaced by a new tuple, say t" , that contains (in the same positions) all the non- 
fresh constants of t, t' and a subset of the fresh constants of i, t' (some of which may 
disappear by the KD chase rule), but no new fresh constant. Therefore, ^ trivially 
also maps i", as well as all other tuples in the chase, into facts of B. 
□ 

The following theorem is the main result of this section, and it characterizes the 
chase as a formal tool for query answering under KDs and IDs. In particular, the 
theorem states that the answers to a query q, posed on an incomplete database D 
under a set S of CDs, can be obtained by evaluating q over the chase of D w.r.t. E, 
chases{D), and discarding the result tuples that contain at least one fresh value. 

Theorem 1 

Let £) be a database for a relational schema TZ and E a set of CDs over TZ. such 
that chases{D) exists. Then, for every conjunctive query q over TZ, we have that 
q^^^{chase^{D)) = ans{q,T.,D). 

Proof 

The theorem is proved by considering a generic database B such that B \= T, and 
B DD. 

By Lemma[3]we derive the existence of a homomorphism fi that sends the facts of 
chase^{D) to facts of 5; if i £ q{chase-s{D)) , there is a homomorphism A from the 
atoms of body{q) to chase^{D) that sends head{q) to t; therefore, the composition 
X o fi is a homomorphism from the atoms of body{q) to B that sends head{q) to 
t, which proves q{chases{D)) C ans{q,T,, D), and, a fortiori, q^^^{chases{D)) C 
ans{q, E, D). 

For the other inclusion, consider that chase^{D) D D and chase-s{D) |= E. Then, 
by Definition [T] we have that a tuple t is a certain answer to g in _D under E only if 
it is an answer to q in chases{D) with no fresh constant; hence q^^\chase^{D)) I) 
ans{q,Yi,D). □ 

Notice that Theorem [1] does not lead to an algorithm for query answering (apart 
from special cases), since the chase may have infinite size. 
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5 Answering Queries by Rewriting 

In this section we present an efficient teclinique for query answering on incomplete 
data in tlie presence of CDs; sucli teclinique is based on query rewriting; in partic- 
ular, the answers to a query are obtained by evaluating a new query, obtained by 
rewriting the original one according to the dependencies, over the initial incomplete 
data. 

For the sake of simplicity, in the remainder of this section we shall disregard 
attributes from our treatment, since attributes are acyclic and therefore can be 
added without changing the results. 

5.1 Query rewriting 

Query answering under CDs can be decided by checking an initial segment of the 
chase of a database. We show that the certain answers to a CQ q over a database D 
can be computed by evaluating q over the initial segment of the chase of D, whose 
size, defined by a maximum level 6m, depends on the query, on the dependencies, 
and on the size Xd of the largest connected part of the join graph of database D. 
The join graph of a database D is an undirected graph that has as nodes the atoms 
of D and has an arc {A,B_) iff A and 5 share a constant. 

Theorem 2 

Let 72. be a relational schema, S a set of CDs over TZ, q a, conjunctive query over 
72, and D a database for which chase^{D) exists. Then, there is a number 5m that 
depends on q, S, 72, and \d such that for every tuple t e q^^\chase^{D)), there 
exists a homomorphism /i sending body(q) to facts of chasesiD) and head{q) to t 
such that all the atoms in ii(body{q)) are in the first Sm levels of chase-s{D). 

Proof 

First of all, we introduce the chase forest for chases{D) given a database D and 
a set of CDs E. The nodes of the forest are the atoms in chase-s{D), and there is 
an arc {Ai,A2) iff A2 is generated from Ai by an application of the ID chase rule. 
The roots in the forest are the atoms in D, and they are at level 0. If there is an 
arc {Ai,A2) and Ai is at level £, then A2 is at level £ + 1. In order to carry on the 
proof, we now prove that a constant can be propagated in the chase for at most a 
fixed number of levels that docs not depend on D. 

Lemma 4 

Let D he a database for a relational schema 72, E a set of CDs over 72 such that 
chasesiD) exists, and q a conjunctive query over 72. Let a be a constant in F 
occurring in an atom in D. Then a never occurs in any fact with level greater than 
Sd^Sc ■ Xd in chase^{D), where 5c = |72| • (1 + |7^| • W\). 

Proof 

We start by considering the IDs. First, observe that, in a set of CDs, the only 
non-unary IDs in S are the IDs encoding is-a arcs between relationships (which are 
full-width IDs) and the IDs regarding attributes of a relationship. Clearly, a can 



Querying Incomplete Data over Extended ER Schemata 



23 



be propagated to other atoms by applications of an ID chase rule, starting from 
the atom ^ G -D in which it occurs, then from the atom generated from 9 by the 
application, and so forth. The propagation can be done for up to |S| more levels if 
there are no cycles in the IDs, but also for more, if there are cycles. 

Whenever there is an application of an n-ary ID (n > 2) on an atom A, the 
generated atom A' contains a permutation of the constants occurring in A; both 
the involved predicates have the same arity n (except in the case of an ID regarding 
attributes of a relationship, where one predicate has arity n + 1, but the (n + l)-th 
argument is never used in the IDs). Then, a sequence of consecutive applications 
of n-ary IDs can go on for at most n\ ■ \TZ\ levels, since there are n\ possible per- 
mutations of the constants in A and there are at most \TZ\ relations involved in 
n-ary IDs. All constants occurring in A (except at most the last one, if A regards 
an attribute of a relationship) are propagated throughout the sequence. 

All other applications regard unary IDs. At least one of the two predicates in- 
volved in a unary ID must be unary, and the only way to retain a in a unary atom 
is that it be of the form e(a), where e is a unary predicate; clearly such fact can be 
generated only once in the chase, and there are at most \TZ\ unary predicates in TZ. 

Any path in the chase starting from 9 consists of sequences of consecutive appli- 
cations of n-ary IDs (n > 2) interleaved by applications of unary IDs. According to 
the previous considerations, there can be at most \TZ\ + 1 sequences of consecutive 
applications of n-ary IDs (with n > 2 and n < W). Given the maximum lengths of 
such sequences, a can be propagated for at most Sc = • (1 + • W\). 

We now consider the KDs. To prove the claim, we first state the following lemma. 

Lemma 5 

Let A be the first atom (of the form r{. . . , zq, . . .), where r is n-ary, n > 2) in 
which a constant zq G T iJTf occurs, with £ = level{A) > 6c- Let B_ be the closest 
predecessor of atom A of the form e{wQ) (e unary). Let 5' be an atom of the form 
e{zi), zi G F U F/, with level{B_') > £ + Sc such that there is an atom C_ of the 
form c'^zq) (e' unary) in the path between A and 5'. Then no constant occurring 
in A other than zq occurs in any of the descendants of _S'. 

Proof 

Atom C_ may well have a child (or a descendant obtained by consecutive applications 
of the ID chase rule for non- unary IDs from the child) D. of the form r'(. . . , zi, . . .) 
such that it agrees on the key of r' (on value zi ) with some descendant D,' of 5' 
of the same form, so that the constants in D. (possibly including zo) will replace 
the corresponding constants of I)' in all the descendants of . Note that _B' is 
necessarily a descendant of C_ with the same constants as D_- This shows that zq 
may well occur in some descendant of S'. Let us indicate with Zg the constant that is 
replaced by zq after the application of the KD chase rule. Assume, by contradiction, 
that one of the constants in A other than zq occurs in some descendant of 5'. 
Then, there must be a descendant A' of 5' of the form r(. . . , Zq, . . .) that, once 
Zq is replaced by zq, fires the application of a KD chase rule between A and A'. 
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There are two cases: (i) generates I)' via a sequence of non- unary IDs. Then 
Zi is replaced by wq, then the subtree rooted in _S' gets to have the same root as 
the subtree rooted in 5 and therefore it disappears as a consequence of the KD 
apphcation. (ii) A' is a descendant of C_' along a path that contains at least an 
application of the ID chase rule for a unary ID, where is obtained from by 
the same sequence of applications of ID chase rules as those generating C_ from B_. 
Again, the KD chase rule makes become equal to C_, therefore the whole subtree 
rooted in disappears, as easily seen, as above. □ 

Consider the proof of Lemma [5] and assume zq G T. Then, after at most 6c levels 
zq will not appear together with any of the other constants in A. Also, zq cannot 
be propagated indefinitely in the chase by applications of ID chase rules, since this 
requires using zq with a unary predicate, which can be done only once per unary 
predicate. However, if zq appears in an atom in D together with another constant 
c, then c could appear together with zq in a descendant of 5', and propagate 
through further 6c levels. By the same principle, this can go on for every sequence 
of constants ci, . . . , c„ such that occurs in the same atom in D together with Q+i. 
Since the maximum sequence of this kind can have length | Ad | , and the sequences 
in D are not altered by the chase construction, the claim follows. □ 

LemmalHis the key property for stopping the construction of the chase at a given 
level 6m without altering query answering. We first prove the claim for the simple 
but important subclass of conjunctive queries called non-boolean (i.e., with at least 
one distinguished variable) connected queries. A set of atoms M is connected if the 
undirected graph (TV, .4) is connected, where M is the set of nodes, and A is the 
set containing exactly all arcs between any two atoms in Af that share a variable 
or a constant. A CQ q is connected if body{q) is. Every maximal subset of body{q) 
that is connected is called a connected part of q. Assume fi is a homomorphism 
sending head{q) to a non-empty tuple t of constants in F and body{q) to atoms of 
chase^{D) . Since the query has at least one distinguished variable, then there is at 
least one atom A in body{q) such that n{A) contains a constant ci of t, that then is 
in r. By Lemma |4l the constants in F cannot occur at levels greater than 6d', then 
level{fi{A)) < 6o- If a query is connected and non-boolean, then among the other 
body atoms there is at least another atom A' sharing a variable with A, and thus 
such that /i(i4') shares a constant with fJ,{A). Note now that ij,{A) contains ci plus 
possibly other constants. If such constants are in F, then also ^(^4') has a level at 
most 6d- Else, they are all fresh and have been created in the subtree rooted in the 
closest unary predecessor 5 of fJ.{A); B_ has the form ei(ci). Now we show that all 
the constants different from ci (say, zi, . . . , z„) in jiA occur within the first 6c levels 
of fjiA, and therefore ^{^) also occurs at a level at most level{fj,A) + 6c- To see 
this, we simply reapply Lemma [5] by considering ^A alone as the starting database 
for the subsequent propagation of constants. Indeed, for 1 < i, j < n, the longest 
path from an atom containing Zi (but not Zj) to an atom containing Zj (but not Zi) 
in the join graph is 1. This process can be iterated for all the remaining atoms in 
the query. Since the size of the longest path in the graph of g is it follows that 
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all the images of the atoms of the query are in the first Sm = + Sc ■ {\q\ — 1) 
levels. 

If the query is not connected, but each connected part is non-boolean, the same 
argument as before applies to each connected part, with the same final 6m- 

If the query has at least a boolean connected part, we can reason as follows. 
Let A be the atom in the connected part whose image niA) is at the lowest level 
among the query atoms. If level{fj,{A)) > Sd, then there is another homomorphism 
n' sending body{q) to atoms of chase-s{D) such that level{iJ.' {A)) < Sd, because 
all types occur within the first 6d levels, where two atoms have the same type if 
they share the same predicate and agree on all the positions where a constant of F 
appears. With the same argument as before, all the images via ji' are at a level at 
most 6m- □ 

The previous theorem suggests a naive strategy for query answering: first, com- 
pute the initial segment of chase^{D), i.e., its first 6m levels, and then evaluate the 
query q on such a segment. To do that, we also need the following Lemma. 

Lemma 6 

Consider the application of a KD chase rule on two atoms Ai and A2 with 
level{Ai) = £1 > 6d and level{A2) = ^2 > 6d- Consider also all subsequent appli- 
cations of KD chase rules before the next application of an ID chase rule. Then, 
after all these applications, no atom in the chase is affected that has level lower 
than min{^i,^2} — 6c- 

Proof 

By definition of the chase, when a KD chase rule is applied, the affected constants 
are the more recent ones in the chase construction. Then, it easily follows that they 
may only occur at most 6c levels before min{€i,€2}- Indeed, Ai and A2 have at 
least a constant in common. Two cases are possible: (i) they share a constant in 
r, therefore they may only occur within the first 60 levels by Lemma [Sj against 
the hypotheses; (ii) they share a constant in Tf. In the latter case, they have a 
common unary predecessor ^0 within 6c levels before min{£i,£2}- In this case, the 
replacement of constants has an impact only on the subtree T rooted in Aq since 
all other constants in T are by construction newer than the one occurring in ^o- 
Ditto for the subsequent applications. □ 

By LemmalU it is immediate to see that the application of the KD chase rule does 
not affect any facts whose depth is smaller by at least 6c levels than the level of 
the facts involved in the KD; therefore, to compute the first 6m levels of chase-s{D) 
means to apply the chase rules of Definition [5] until no chase rule is applicable on 
facts at a level smaller than 6m +6c- However, it is easy to see that such a strategy 
would not be efficient in real-world cases, where D has a large size. Our plan of 
attack is then to rewrite q according to the CDs on the schema and on the size A d 
of the largest connected part of the join graph, and then to evaluate the rewritten 
query over the initial data. This turns out to be more efficient in practice, if Ad 
is bounded or known to be reasonably small, since it does not involve the entire 
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database D in the query processing, except for the last evaluation step, so most of 
the computation is kept at the intensional level. In particular, the rewritten query 
is expressed in Datalog, and it is the union of two sets of rules, denoted H^' and 
n^'^ , that take into account IDs and KDs respectively, plus a set of rules n*^' that 
simulates equality. Finally, function symbols present in the rules will be eliminated 
to obtain a Datalog rewriting. 

Consider a relational schema TZ with a set S of CDs, with S = S/ US/f , where S/ 
and Eif are sets of IDs and KDs respectively. Let g be a CQ over TZ; we construct 
W"^, U^' and II^'^' in the following way. 

Encoding equalities. We introduce a binary predicate eq/2 that simulates the 
equality predicate; to enforce reflexivity, symmetry and transitivity respectively, we 
introduce in 11'^'' the rules 

(a) eq{Xi, Xi) <— r{Xi, . . . , Xn) for all r / n in TZ and for all i G {1, . . . ,n} 

(b) eq{Y,X)^eq{X,Y) 

(c) eq{X,Z)^eq{X, Y),eq{Y,Z) 

Similar rules for encoding equalities are found, for instance, 
in dDuschka and Levy 19971 IGottlob and Nash 2008]) . 

Encoding key dependencies. For every KD fcey(r) = {k} (notice from Sec- 
tion|3]that in the case of CDs all keys are unary if the original EER schema contains 
no attributes), with R of arity n, we introduce in 11^^ the rule 

eq{X,,Y,) ^ r{Xi,...,Xk-i,Xk,Xk+i,...,X„), 

r(ri,..., Ffe.i, Ffc, Yk+i,...,Y„),eq{Xk, Y^) 

for all i s.t. \ < i < n, i ^ k. 

Encoding inclusion dependencies. The encoding of a set S/ of IDs into a set 
of rules is done in two steps. Similarly to (jCali et al. 200H ICali 2003j) . every 
ID is encoded by a logic programming rule 11^^ with function symbols, appearing 
in Skolem terms that replace existentially quantified variables in the head of the 
rules; intuitively, they mimic the fresh constants that are added in the construction 
of the chase. We consider the four cases that are possible for an ID cr in a set of 
CDs coming from an EER schema without attributes: 

(1) cr is of the form ri[l] C r2[l], with ri/1, r'2/l: we add to 11^^ the rule r2{X) <— 

n{x). 

(2) a is of the form ri[k] C r2[l], with ri/n, r2/l, 1 < A; < rt: we add to 11^' the 
rule r2{Xk) ^ n[Xi, . . . , Z„). 

(3) a is of the form ri[l,...,7i] C r2[ji, . . . , j„], with ri/n, r2/n, and where 
(ji, . . . ,jn) is a permutation of (1, ... , n): we add to 11^^ the rule 
r2(X,,,...,X,J^ri(Xi,...,Z„). 

(4) cr is of the form C r2[k], with ri/1, r2/n, 1 < A; < n: we add to 11^' the 
rule r2(/.4(X),...,^,fc_i(X),X,/<,,fc+i(X),...,/<,,„(X)) ^ n{X). 

Note that in (4) we have used subscripts of the form a,j so as to indicate that for 
every dependency and for every attribute of r2 there is a different function symbol. 

Example 6 

Consider the dependencies that do not involve attributes (0^-01^) from Example [21 
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They can be encoded as follows. 



0-4 


employee(X) 


^ worksJn(X, Y) 




0-5 


dept(V) 


<— worksJn(X, Y) 




CT6 


manager(X) 


<— manages(X, Y) 




0-7 


dept(r) 


<— manages(X, Y) 




CT8 


employee(X) 


<— manager(X) 




o-g 


worksJn(X, Y) 


<— manages(X, y) 






worksJn(X,/^i„^2(^)) 


^ employee(X) 




0-11 


manages(X,/^ii.2(X)) 


<— manager(X) 




"■12 




^ works_in(Xi, Fi), works. 


in(X2, F2),eg(Xi,X2) 






<— manages(Xi, Fi), manaj 


5es(X2, Y2),eq{Xi,X2) 



Query maquillage. Since we need to deal with equalities among values in 
a uniform way, wc need some maquillage (that we call equality maquillage) 
on q: replace every term t in body{q), with a new variable X not occur- 
ring elsewhere in q, and add (as a conjunct) to hody{q) the atom eq{X,t). 
Henceforth, we shall denote with q^q the query after the equality maquil- 
lage. For example, the query q{X) ^ r{X , c, Y), s{ Y) becomes q{X) ^ 
r{A, B, C), s{D),eq{A, X), eq{B, c),eq{C, Y),eq{D, Y). 

We shall now state that the encoding of CDs by means of the above rules captures 
the correct manipulation of facts that is done in the chase (that, wc remind the 
reader, represents the inference of information done starting from the initial data 
and the CDs, under the sound semantics). In order to do that, in Theorem [3] below, 
we first need to introduce a few auxiliary constructions and lemmata. 

Wc introduce a variant of the chase with equality predicates, denoted 
chase'^'^-s{D), that is built as follows from a database D and a set of CDs E. 

(1) Add all atoms of the form eq{c, c), at level 0, where c is a constant occurring 
in D. 

(2) Include all the facts in D and proceed as for chases{D), but 

(a) A KD is applicable if there is a key constraint key{r) ~ {ki, . . . , k„} 
and the chase result constructed so far contains the facts r{t), r{t'), and 
eg(ai, /3i ),..., eg(Q;„, /3„), with a, = t[h] and = t'[ki]. When applying 
the KD rule, instead of merging tuples by replacing the two constants ai 
and Pi by a combined symbol, add the atoms eq(ai,/3i), eq{/3i,ai) and all 
the eg atoms that can be derived from the existing ones by transitivity; the 
level of these eg atoms is the same as the lower of the two facts that fired 
the rule. 

(b) An ID rule is applicable if there is an ID r[fci, . . . , fc„] C s[ji, . . . ,jn] such 
that the chase result constructed so far contains the fact r{t) but there is 
no fact s{t') such that, for every i such that I < i < n, eq{t[ki], t'[ji]) is 
in the chase result constructed so far. When applying the ID rule, add the 
atom eg(a, a) for each new fresh constant a in the newly introduced fact; 
the level of eg(a, a) is the same as the level of the new fact. 
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(c) Whenever an atom of the form eg(ci, C2) is added, where ci, C2 G F, and 
ci 7^ C2, stop the chase procedure (the chase fails). 

Example 7 

Consider again the EER schema of Example [5] and the initial (incomplete) database 
D = {manager(m), worksJn(m, d)} given in ExampleHl Then chase'^'^ y,{D) consists 
of D plus the following facts: 

• eq{m, m), eq{d, d) (constants at level 0) 

• employee(m), manages(m, ai), works_in(m, ai), dept(ai), where ai is a fresh con- 
stant (applications of ID chase rules) 

• eq{ai^ai) (new fresh constants) 

• eq{ai, m), eq{m,ai) (application of a KD chase rules) ■ 

It is straightforwardly seen that chase^{D) exists if and only if chase'^'^^{D) 
exists. Clearly, as stated in the following lemma, an isomorphism can be established 
between the atoms in chase'^'^Y,{D) and those in the least Herbrand model of the 
program consisting of D plus the rules encoding IDs, KDs, and equality. 

Lemma 7 

Consider a database D over a relational schema TZ with a set of CDs S = S/ U , 
where and S/ are sets of KDs and IDs respectively, such that chase-s{D) exists. 
Let n be the program 11^^ UII^'^ UW'^UD and M its least Herbrand model. Then, 
there is an isomorphism /i : F U F/ — > t/n, where Uu is the Herbrand univers clof 
H, such that: (i) iJ,{chase^'^j:{D)) — M; (ii) if a G F/ then fi{a) is a Skolem ground 
term in Uu- 

Proof 

We exhibit the construction of a homomorphism with the desired properties. The 
construction will be inductive on the applications of the immediate consequence 
operator in the construction of M. We start from D, and we take the identity 
isomorphism mapping D (as a subset of chase^'^s{D)) into D (as a subset of M). 
Now we consider the following cases of application of the immediate consequence 
operator, on different kind of rules. 

(1) Rule in H^' . Assume we are adding a fact s(is) because of a rule p of the 
form s(-) -f- r(-) encoding a dependency a of the form r[-] C s[-], where r(ir) is 
a fact in the part M* of M constructed at a certain point. Since, by induction 
hypothesis, fi (so far) maps a subset of chasef^{D) to M*, we take p~^{r(tr)), 
which is of the form r{ur): by application of the ID chase rule on a (encoded by p), 
we get the addition of a fact s{us). Now extend ^ by adding to it {Ms[j] — >■ ts[i\} 
for every i such that Us[i] is a newly introduced fresh constant (or, equivalently, 
the corresponding argument in p's head contains a Skolem term). 

^ Usually, the Herbrand universe is constructed with respect to a language, but often we can talk 
about the Herbrand universe of a logic program, intending the Herbrand universe constructed 
with the constants and function symbols present in that program. The same holds for the notion 
of Herbrand base. 
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(2) Rule in 11^'^. The construction is the same as above, where the added fact 
in M* is of the form eq{ti, t2), with {ti, t2} Q Un, and the one in chase'^'^ ^{D) is 
of the form eq{ui, U2), with {ui, U2} C F U Tf. 

(3) Rule in 11^*. It is straightforwardly seen that rules in IV' introduce equal- 
ity atoms, whose corresponding atoms in chase'^'^siD) are introduced by enforc- 
ing reflexivity, symmetry and transitivity of the predicate eq, as described in the 
construction of chase'^'' ^{D) . The homomorphism /i is extended accordingly in an 
obvious way. 

It is immediate to see that the isomorphism /i constructed as above is such that 
values in are mapped to Skolem terms (containing function symbols) and vice- 
versa, and that ^{chase'^''s{D)) = M. □ 

The previous lemma shows an isomorphism between the chase with equalities and 
the least Herbrand model of the program comprising the rules for IDs, KDs, equal- 
ities, and the database. Notice that this result holds for general IDs and KDs, and 
not only for CDs: in fact, arbitrary IDs and KDs can be encoded in the same way 
we did for CDs. 

We then use Lemma [7] to extend the notion of level to the atoms of the least Her- 
brand model: the level of such an atom is defined as the level of the corresponding 
(via the isomorphism) atom in the chase with equalities. 

Next, we show that, if we exclude the tuples containing fresh constants, the 
answers to a query over the chase coincide with the answers to the query after 
maquillage over the chase with equalities. 

Lemma 8 

Consider a conjunctive query q over a relational schema TZ with a set of CDs 
S = S/UEif , where Tik and E/ are sets of KDs and IDs respectively, and a database 
D for TL, such that chase-s{D) exists. Then the tuples in qeq {chase'^''^{D)) coincide 
with those in gl'"! (c/ia.ses(-D)). 

Proof 

By construction of chasef^{D), if we eliminate all atoms of the form eq{a, /?) from 
chase'^'^Y.iD) and replace a with /3 (or P with a, provided that the replacing one 
is the fresh constant that lexicographically comes first), we obtain chasej:{D). We 
call this process equality elimination. Suppose that tuple t consisting of non-fresh 
constants is in qeq{chase'^''^{D)). Then there exists a homomorphism ^ sending 
body{qeq) to atoms of chase'^''s(D) and head{qeq) to t. By applying equality elim- 
ination to fj,(body{qeq)) we then obtain atoms in chase^{D). These arc, in turn, 
an image for a homomorphism ^' from hody{q) to atoms of chaseY.{D). This can 
be seen as follows. Consider an atom of the form eq{X , u) in body{qe,q) such that 
fi{body{qi,q)) = eq{ci,C2), where X is a variable, u a term, and ci,C2 6 F U F/. 
Each time an atom of the form eg(ci, C2) is eliminated by equality elimination 
from fi{body{qi;q)), remove eq{X , u) from q^q and replace in it all occurrences of 
the variable X with the term u. At each step of the eq elimination process, the 
two structures are isomorphic; at the end, qeq is transformed into a variant of q 
(i.e., the same as q modulo variable renaming), which proves that q is isomorphic 
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to the result of the equahty elimination applied to ^{body(qeq)), i.e., there is the 
homomorphism ^' we were looking for. By construction of Qeq, if t contains no fresh 
constant, then n' necessarily maps head{q) to t. 

For the other inclusion, consider a homomorphism n' sending body{q) into atoms 
of chaseY,{D) and head{q) into t. If the atoms in ji' {body(q)) are in £), these are 
necessarily also in chase^'' s{D) , so all non-eg atoms in hody{qeq) can also be mapped 
to them by some homomorphism then, the eq atoms require the equality of 
constants in £), that are necessarily present in chase'^'^siD). Then t is also an 
answer in qeq{chase'^'^-s{D)). By construction of the chase'^'^ y.{D) , for every fact / 
in chases{D) there is a subset S of chase'^'^ -^{D) , containing only one non-eq fact 
/', such that equality elimination on S yields /; we say that /' corresponds to /. 
If some atom in jj! (hody{q)) is not in Z), it may have been generated by an ID 
rule or by a KD rule. In the case of an application of an ID rule on a fact / in 
the chase, then there is a corresponding fact /' e chase'^''s{D) on which the same 
application is made; note that no tuple merging caused by KD rules in the chase 
causes new applications of an ID rule. For a KD rule, in the chase an application 
instantiates fresh constants to other constants from two starting tuples; in the chase 
with equalities, the new tuple is not generated, but the two starting tuples remain, 
and eq atoms are generated for all merged constants. This means that if an atom 
in q is mapped into such a merged fact, the corresponding (non-eg) atom in qeq 
can still be mapped into any of the two starting tuples. By construction of qeq, the 
body of qeq contains one eq atom per term in g, so that each such term can be 
equalled to the replacing constant in the KD rule application (or be left unchanged 
by mapping the eq atom to one that equals the term to itself). □ 

With an argument similar to the one used in the proof of Theorem [21 it can 
be shown that, also for the chase with equality, Sm levels are sufficient for query 
answering. This result is stated below as a corollary of Theorem [5J 

Corollary 1 

Let I? be a database for a relational schema 7?., S a set of CDs over TL such 
that chases{D) exists, and q a conjunctive query over TZ. Then, for every tuple 
t S g[^l(c/iase^'E(D)), there exists a homomorphism fi sending body{q) to facts of 
chase'^'^^{D) and head{q) to t such that all the atoms in fi(body{q)) arc in the first 
Sm levels of chase'^'^^(D), where Sm is as in Theorem [S] 

Now we can show the main result of this subsection as a consequence of the 
previous results. This result validates our encoding of inclusion dependencies, key 
dependencies and equalities into 11^', 11^^, 11'^'' and the query maquillage that 
returns qeq from q. Indeed, if we put together 11^^ , 11^^ , 11'^' and qeq into a program 
Ilq^^, and we evaluate it over a set D of ground atoms, discarding the answer tuples 
that contain function symbols, we get exactly the certain answers to q, evaluated 
over D under S/ U S^f . 

Theorem 3 

Consider a conjunctive query q over a relational schema TZ with a set of CDs 
E = S/ U Sif , where Y^k and E/ are sets of KDs and IDs respectively, and a 
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database D for TZ, such that chase^{D) exists. Let 11 be the set of Horn clauses 
gegUn^^ un^^^- \JW and let n^^,(£') be the largest function-free subset of H^^JD). 
Then li^^^ {D) = ans{q, S, D). 

Proof 

By Lemma [71 we know that, if we exclude the atoms with predicate qeq^ the least 
Herbrand model M of IT U I? coincides with chase^'^siD) modulo an isomorphism 
that sends the fresh constants into Skolem terms, and the non-fresh constants into 
themselves. Therefore, Ilq^^{D) coincides with the answers in qeq{chase'^'^^{D)), 
modulo this isomorphism; moreover, n^^_^(D) coincides with q^Jg^chase^i^iD)), 
since, because of the bijection, atoms with fresh constants correspond to atoms 
with Skolem terms, and vice versa. 

By Lemma[8l we know that q^\chase'^'''^{D)) = q^^\chase^{D)). 

Finally, Theorem [T] guarantees that q^^\chase^{D)) ~ ans^q,!^, D), which con- 
cludes the proof. □ 

The above result is crucial because it shows the correctness and completeness of 
the encoding of the constraints into logic programming rules. 

In the next subsection we show how to eliminate the function symbols from H, 
thus obtaining a program expressed in pure Datalog. 

5. 2 Elimination of function symbols 

Now, we want to transform the set of rules 11 of Theorem [3] into another set which 
has pure Datalog rules without function symbols. The reason to do so is that in 
this way we can take advantage of efhcient Datalog engines, while evaluating logic 
programs with function symbols would certainly be an overkill. 

To do that, we adopt a strategy somehow inspired by the elimination of function 
symbols in the inverse rules algorithm (jDuschka and Genesereth 1997P for answer- 
ing queries using views. The problem here is more complicated, due to the fact 
that function symbols may be arbitrarily nested in the least Herbrand model of 
the program. The idea here is to rely on the fact that there is a finite number Sm 
of levels in the chase that is sufficient to answer a query, as stated in Theorem [2l 
We shall construct a Datalog program that mimics only the first Sm levels of the 
chase, so that the function symbols that it needs to take into account are nested 
up to Sm times. The strategy is based on the "simulation" of facts with function 
symbols in the least Herbrand model of H U I? (where D is an initial incomplete 
database) by means of ad-hoc predicates that are annotated so as to represent facts 
with function symbols. 

Definition 6 [Annotation, annotated predicate, annotated version of an atom) 
Let A be an atom of the form r(ti, . . . , i„), where every term ti is of the form 
/i,i(/j,2(- ■ •/j,m.(^'i) ■ • •)): every fij is a unary function symbol, and every Oi is 
either a constant in F U F/ or a variable. The sequence f] = 771,..., 77„, with 
— /j,i(/i,2(- ■ ■fi,ra, (•) • ■ ■)) i IS Called the annotation of A. The new n-ary predicate 
is called the annotated predicate for A, and the function- free atom r^{6i, . . . ,9^) 
is called the annotated version of A. m 
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Example 8 

The annotated version of the atom worksJn(X,/(jj,-|^2(-''^)) occurring in the head of 
rule CTio in ExamplelHis wGrksjn*'-'^'iO'^'^*^(X, X). ■ 

Now, to have a program that yields function-free facts as described above, we 
construct suitable rules that make use of annotated predicates. The idea here is that 
we want to take control of the nesting of function symbols in the least Herbrand 
model of the program, by explicitly using annotated predicates that represent facts 
with function symbols; this is possible since we do that only for the (ground) atoms 
that mimic facts that are in the first 5m levels of the chase of the incomplete 
database. Here we make use of the fact, proved in Lemma[7l that the least Herbrand 
model of H^' U U^" U H'^'' U I? coincides with chase'''' ^{D), modulo renaming of 
the Skolem terms into fresh constants. Therefore, we are able to transform a (part 
of a) chase into the corresponding (part of the) least Herbrand model. 

To do so, we construct a "dummy chase" , and transform it, in the following way. 

Definition 7 {Dummy database, dummy chase, dummy chase rules) 
Consider a relational schema TZ with a set E/ of IDs. 

(1) Let i? be a database for TZ consisting of exactly one fact of the form 
r ( ci , . . . , c„ ) for every relation r/n € TZ, where ci , . . . , c„ are distinct constants such 
that no constant occurs in more than one fac1o; B is called the dummy database 
for TZ. 

(2) Let chase^'^{B) denote the initial segment of chase^j{B) consisting of the 
first 5m levels; chase^'^ (B) is called the dummy chase for TZ and S/. 

(3) Let H be as chase^^{B), but where each fact (possibly containing fresh 
constants) is replaced with the corresponding atom (possibly containing function 
symbols) in the least Herbrand model of H^' U B; note that such a correspondence 
exists by Lcmma[7l because without KDs, if we exclude the eq atoms, chases, (B) 
and chase'^'^Y., {B) coincide. 

(4) Let H' be as H, but where every atom is replaced with its annotated version. 

(5) We denote with H^*-^ the set of all rules of the form A^2 ^ Mi such that 
(a) there is an arc {AxtA-t) ii^ ^-i^d (h) by replacing every distinct constant with 
a distinct variable in {A^tAj^), we obtain The rules in H^*-^ are called 
dummy chase rules. ■ 

Example 9 

Consider Example^ in the dummy chase, we introduce, among the others, the fact 
employee(c). This fact generates, according to the ID crio : employee[l] C worksJn[l], 
the fact worksJn(c,/cri(,_2(c)) (after the transformation of the fresh constants into 
Skolem terms). Its annotated version is worksJn*'^"iO'''''(c, c). Therefore, H^'^ con- 
tains, among the others, the rule worksJn''-^''iO'^''^(X, X) <— employee*(X). ■ 

^ It does not matter whether they are fresh or non-fresh, since they will disappear at the end of 
the process. 
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The dummy chase determines ah possible nesting sequences of function symbols 
that may occur in the first Sm levels of the least Herbrand model of the program 
U n^''' U U D-. only IDs generate function symbols, and the dummy chase 
produces all possible function symbol sequences that may occur for every relation. 
We next show how to generate a new annotated, function- free program from 
U n^''' U H'^'^. Preliminarily, we need some notation: we denote with X[h] the 
h-th term of a sequence X, and with fi[h] the h-th element of an annotation fj 
(which is in turn a sequence). 

Definition 8 (Function-free rewriting for conceptual dependencies) 
Consider a conjunctive query q over a relational schema TZ with a set of CDs 
E = S/ U Sif, where T,k and E/ arc sets of KDs and IDs respectively. Let 11^" be 
the set of all rules, called base annotation rules, of the form r''---''{Xi, . . . , X„) <— 
r{Xi, . . . , Xn) for every predicate r gTZL) {eq}. 

We define 11*'^ as the set of rules 11^*-^ U 11^" plus all possible rules of the form 
(h) ^ pT (ii), ■ • ■ , pT C^k) such that: 

1. There is a rule po(^o) ^ Pi{k), ■ ■ -^Pkitk) in U^" U qeq. 

2. Each annotation element f]i[j] occurs in some rule in 11^*^. 

3. If I [j] - h []'] then fj, [j] - f},, []'] . m 

Base annotation rules are just a convenient renaming that allows us to refer to 
the annotation to capture also the facts in the database. Note that 11^^ is 

not included in the program since it is already encoded in H^*^ in a function-free 
fashion. 

Example 10 

Consider the dependency 

0-13 : eg(yi, Y^) ^ manages(Xi, ^i), manages(X2, Ya), e(7(Xi, X2) 

encoding the KD fcey(manages) = {1} from Example [2J Among the annotations 
occurring m we have faio,2{*) and • (note that • necessarily does), as shown 

in Example [HI Then 11^'^ will include, among others, the rules 

e(7*'*(Fi, F2) ^ manages'^'(Zi, Fi), manages'^'(X2, ^2), eg'''(Xi, X2) 

eq*'*(Fi, F2) ^ manages^-io.2(*)'*(Xi, Fi), manages'''(X2, F2), eq^<'io.^(*)''(Xi, X2) 

eq'''(Fi, F2) ^ manages'^'(Xi, Fi), manages^-io.^(')''(X2, F2), eq'^^-io.^W^Xi, X2) 

eg/-io,2(«).«(yi, Y2) ^ manages'^^-io.2(')(Xi, Fi), manages'^'(X2, F2), eg'''(Xi, X2) 

eg*./<'io.2(«)(yi, Y2) ^ manages'^'(Xi, Fi), manages'>/'io.^(') (X2, F2), eg'''(Xi, X2) 

eg*'*(Fi, F2) ^ manages^"io.2(')^'(Xi, Fi), manages-f-io.^(')^'(X2, F2), eg^-io.=(')'/-io.2(')(Xi, X2) 

■ 

Now we can state our central theorem. 
Theorem 4 

Let D be a database for a relational schema 7^, S a set of CDs over TZ such 
that chases{D) exists, and q a conjunctive query over TZ. Then, Il'i^ . ,.(!?) = 

Qeq 

ans{q, S, D). 
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Proof 

The proof is based on the the fact that the least Herbrand model M of H'^'^ U 
D is a representation of the first Sm levels of the least Herbrand model Mf of 
Qeq U U U UD.By Lemma H the first 6m levels of Mf are isomorphic 
with the first Sm levels of chase^'^ y,{D) . By Corollary [TJ the (non-fresh) answers 
to q^q over the first 5m levels of chase'^'' ^{D) coincide with those found over the 
whole chase^'^^{D). By LemmalU the (non-fresh) answers to qeq over chase^'^siD) 
coincide with the (non-fresh) answers to q over chase-s{D), which, by Theorem [1] 
coincide with ans{q, S, D). Hence, to prove the thesis, we need to show that there 
is a correspondence between the facts in M and those in the first Sm levels of Mf. 

We then represent the atoms in M and those in the first Sm levels of Mf as two 
isomorphic structures. Consider therefore the atoms in Mf as being disposed in lev- 
els (as in the corresponding chase with equalities) . Every two atoms corresponding 
to an ID rule application are connected by an arc. An eq atom has an incoming arc 
for each corresponding atom in the first rule (in H^'^ or H*^*) that produced it via 
the immediate consequence operator. If we exclude eq atoms, Mf is a forest whose 
roots are the atoms in D; if we include the eq atoms, we have a directed acyclic 
graph, since eq atoms may have several parents. We now show that, for each atom A 
of the form p{di, . . . ,6n) in the first Sm levels of Mf there is an atom 5 of the form 
p^^'---'^"[ci, . . . , c„) in M, where each r]i is the annotation element corresponding 
to 9i and Ci its innermost constant. Consider all the ancestors of A in Mf. 

If p is not the eq predicate, there is a path Ag, . . . , j4„ = ^4 in Mf, such that 
A^ is at level i and A^ is ^ij^^'s parent. We prove the claim by induction. As 
base case, we show that there is an atom B_q in M corresponding to and an 
annotation corresponding to Aq's predicate and terms in H''^; but this is obvious, 
since Aq € D and all atoms in D are also in M; besides, they also exist in M with 
a annotation, because of the base annotation rules in H'"'. As inductive step, 
assume the claim holds for all A^ with j < i and an annotation corresponding to 
A^s predicate and terms is in H'^'^ (let it be r^'); we show that it also holds for 
j4j_|_j^. There is an ID that generates A^^-^ from A^. By inductive hypothesis, since 
we are within the first 5m levels, there must be a rule in H^'^ corresponding to 
the ID in question, with an atom with predicate in the body. The application of 
the immediate consequence operator on that rule will produce, by construction, an 
atom whose predicate annotation matches ^i^+i's predicate and terms, and whose 
constants match A.j^^iS innermost constants. 

If p is eg, the proof is as above, but instead of a single path, there may be multiple 
paths of the form Aq, . . . ,A^ = ^; the above argument can be applied to any of 
them. The only difference is that, instead of ID rules, eq atoms are generated either 
by KD rules in H^'^' or by the equality rules in H'^*. For all such rules (and for all 
the atoms they are applied to) there arc the corresponding annotated counterparts 
in H'^'^ that have been added by the algorithm for rule annotation. 

This proves that, apart from the qeq atoms, all the atoms in the first 5m levels 
of Mf have a corresponding annotated atom in M . Now, the algorithm for rule 
annotation has added to H'^'^ all possible versions of qeq in which the head is 
annotated q'q "'' and the positions in which the same variable occurs in the query 
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are annotated in the same way, with all possible annotations occurring in the first 
Sm levels of Mf. Therefore the qeq tuples in Mf are contained in the q'^ "'* tuples 
in M. 

For the other inclusion, we simply need to dispose the atoms in M according to 
levels, as we did for the atoms in Mf. Starting from the atoms of D in M and the 
eq atoms on constants in D, by the base annotation rules we obtain the same atoms 
with annotation these annotated atoms are at level in M; the non-annotated 
atoms are never used by any other rule in 11'^'^ and can be disregarded. Every other 
rule in H^'^, when used by the immediate consequence operator, generates an atom 
(in the head) starting from other atoms (in the body); when the generated atom is 
new, we draw an arc from each body atom to the head atom, and give it the level 
£ + 1, where £ is the maximum level of the body atoms. The resulting structure 
is again a directed acyclic graph, and from this we can proceed as for the other 
inclusion and prove that for each atom in M, a corresponding non- annotated atom 
exists in My, since every rule produced by the algorithm for rule annotation, apart 
from n''° , is a syntactic variant of rules in q^q U H'^'^ U 11^'^ , and the rules in 
mimic the rules in 11^' . □ 

The above theorem suggests our final strategy for computing the answers to a 
conjunctive query q expressed over an EER schema, given a database D. 

(1) We derive a set S of CDs that represent the EER schema. 

(2) We check whether chase^{D) exists, as described in the proof of Lemma (TJ in 
time polynomial in \D\. 

(3) Then, we derive a Datalog rewriting that computes all certain answers to g, 
according to Theorem S) 

(4-) Finally, we evaluate the Datalog rewriting on D. 

5.3 Considerations on complexity 

We focus here on data complexity, i.e., the complexity w.r.t. the size of the data, 
that is the most relevant, since the size of the data is usually much larger than that 
of the schema. 

Proposition 2 

The complexity of computing the certain answers to a CQ over an EER schema is 
polynomial in the size of the data if the size Ad of the largest connected part in 
the join graph of the instance of the EER schema is bounded. 

Proof 

From a CQ q over an EER schema, given a database _D, wc can proceed as follows. 
(1) We check whether the chase exists, which can be done in polynomial time in the 
size of D by Lemma[T] if it does not, then query answering is trivial (all n-tuples are 
in the answer to the query g, where n is the arity of g); (2) we construct a Datalog 
rewriting for q, according to what was explained in the previous pages, which does 
not depend on D but only on A^), which is assumed to be bounded; (3) we evaluate 
the rewriting on the data. Since the evaluation of a Datalog program is polynomial 
in data complexity ([Dantsin et al. 200ip . the thesis follows. □ 
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5.4 Extensions of Results 

Dealing with inconsistencies. First of all, as we mentioned in Section 14. 2i we 
have always assumed that the initial, incomplete database satisfies tlie KDs derived 
from the EER schema. This assumption does not limit the applicability of our re- 
sults, since violations of KDs can be treated in different ways. (1) Data cleaning 
(see, e.g., ([Hernandez and Stolfo 1998)) ): a preliminary cleaning procedure would 
eliminate the KD violations; then, the results from (jCali 2006^ ensure that no vio- 
lations will occur in the chase, and we can proceed with the techniques presented in 
the paper. (2) Strictly sound semantics: according to the sound semantics we have 
adopted, from the logical point of view, strictly speaking, a single KD violation in 
the initial data makes query answering trivial (any tuple is in the answer, provided 
it has the same arity of the query); this extreme assumption, not very usable in 
practice, can be encoded in suitable rules, that make use of inequalities, and that 
can be added to our rewritings. We refer the reader to (jCali et al. 2003bj) for the 
details. (3) Loosely-sound semantics: this assumption is a relaxation of the previous 
one, and is reasonable in practice. Inconsistencies are treated in a model-theoretic 
way, and suitable Datalog"' rules (that we can add to our programs without any 
trouble, obtaining a correct rewriting under this semantics) encode the reasoning on 
the constraints. Again, we refer the reader to (jCali et al. 2003b|) for further details. 

Adding disjointness. Disjointness between two classes, which is a natural ad- 
dition to our EER model, can be easily encoded by exclusion dependencies (EDs) 
(see, e.g. (jLembo 2004p ). The addition of EDs to CDs is not problematic, pro- 
vided that we preliminarily compute the closure, w.r.t. the implication, of KDs 
and EDs, according to the (sound and complete) implication rules that are found 
in (jLembo 2004[) . After that, we can proceed as in the absence of EDs. 

6 Discussion 

Summary of results. In this paper we have employed a conceptual model based on 
an extension of the ER model, that we called EER (Extended Entity- Relationship), 
and we have given its semantics in terms of the relational database model with 
integrity constraints. We have thus carved out a relevant class of relational con- 
straints, which is a subclass of the well-known key and inclusion dependencies; 
such a class is important, because in real-world database design the constraints are 
directly derived from an ER schema. In fact, the focus of our contribution is on 
querying incomplete data under an interesting class of relational constraints, rather 
than on proposing another query language for EER schemata. Moreover, we argue 
that our results are independent of the translation from EER to relational. 

We have considered conjunctive queries expressed over EER conceptual schemata, 
and we have tackled the problem of providing the certain answers to queries in 
such a setting, when the data are incomplete w.r.t. the constraints that encode the 
conceptual schema. We have characterized a class of relational constraints, namely 
conceptual dependencies (CDs), that are able to represent EER schemata. This class 
is a subclass of KDs and IDs (in the general case the query answering problem is 
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undecidable (jCali et al. 2003a")) ). In this way, we have reduced the query answering 
problem under EER constraints into the equivalent problem of query answering 
under CDs. 

We have provided a query rewriting algorithm that transforms a conjunctive 
query q into a new (recursive Datalog) query that, once evaluated on the incomplete 
data, returns the certain answers to q. 

Finally, we have shown how our results can be extended to more general settings, 
in particular: (1) EER schema with class disjointness; (2) the so-called loosely- 
sound semantics for incomplete data, that overcomes the limitations of the strictly 
sound one. 

Related work. Several works propose query languages for differ- 
ent flavors of EER schemata ( [Lawley and Topor 1994[ IGrant et al. 19931 
[Hohenstein and Engels 1992| IThalheim 2000p . Our query language, which does not 
introduce novel features or characteristics, relies on a standard translation of EER 
schemata into relational ones. 

As pointed out earlier, query answering in our setting is tightly related 
to containment of queries under constraints, which is a fundamental topic 
in database theory (jChan 1992| ICalvanese et al. 1998| Johnson and Klug 1984 



IKolaitis and Vardi 1998[) . (jCali et al. 2001[) deals with conceptual schemata in the 
context of data integration, but the cardinality constraints are more restricted than 
in our approach, since they do not include functional participation constraints and 
is-a among relationships. 

Other works that deal with dependencies similar to those presented here 
are (jCalvanese et al. 20051 ICalvanese et al. 2006p . which deal with a formalism 
called DL-Lite and based on Description Logic; it is easy to establish a correspon- 
dence between EER entities and DL-lite concepts, and between EER relationships 
and DL-litc (binary) roles. However, the set of constraints considered in the above 
works is not comparable to CDs: while it contains some constructs not expressible in 
EER, on the other hand it is unable to represent, for instance, the is-a among rela- 
tionships, which we believe is the major source of complexity in the query answering 
problem. Also (jOrtiz et al. 2006)) addresses the problem of query containment using 
a formalism for the schema that is more expressive than the one presented here; 
the problem is proved to be coNP-hard. In ()Calvanese et al. 1998p . the authors 
address the problem of query containment for queries on schemata expressed in a 
formalism that is able to capture our EER model; in this work it is shown that 
checking containment is decidable and its complexity is exponential in the number 
of variables and constants of qi and q2 , and doubly exponential in the number of 
existentially quantified variables that appear in a cycle of the tuple-graph of 52 (we 
refer the reader to the paper for further details). Since the complexity is studied by 
encoding the problem in a different logic, it is not possible to analyze in detail the 
complexity w.r.t. \qi\ and \q2\, which by the technique of (jCalvanese et al. 1998P 
is in general exponential. If we export the results of (jCalvanese et al. 1998P to our 
setting, we get an exponential complexity w.r.t. the size of the data for the decision 
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probleirO of answering queries over incomplete databases. In our work we provide 
a technique that also serves the purpose of computing all answers to a query in the 
presence of incomplete data. 

Our technique for dealing with the non-repairable violations in the chase 
is the same as in (jCali et al. 2003a)) . This is along the lines of con- 
sistent query answering (jArenas et al. 1999| : a similar approach is found 
in (jChomicki and Marcinkowski 2005)) . 

Fiiture work. As future work, we plan to extend the EER model with more 
constraints which are used in real- world cases, such as covering constraints or more 
sophisticated cardinality constraints. We also plan to further investigate the com- 
plexity of query answering, providing a thorough study of complexity, including 
lower complexity bounds. Also, we are working on an implementation of the query 
rewriting algorithm, so as to test the efficiency of our technique on large data sets. 
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